Volcano

Volcano收集了内部的芯片故障、参数面网络故障和节点故障信息,将其作为对外的信息放在K8sConfigMap中,以供外部查询和使用。

查询命令为kubectl describe cm -n volcano-system vcjob-fault-npu-cm,命令回显示例如下,关键参数说明请参见表1
Name:         vcjob-fault-npu-cm
Namespace:    volcano-system
Labels:       <none>
Annotations:  <none>

Data
====
fault-node:
----
[{"FaultDeviceList":[{"fault_type":"CardNetworkUnhealthy","npu_name":"Ascend910-0","fault_level":"PreSeparateNPU","fault_handling":"PreSeparateNPU","large_model_fault_level":"PreSeparateNPU","fault_code":"81078603"},{"fault_type":"CardUnhealthy","npu_name":"Ascend910-4","fault_level":"SeparateNPU","fault_handling":"SeparateNPU","large_model_fault_level":"SeparateNPU","fault_code":"A8028801,A4028801,80E18402,80E18401"}],"NodeName":"node133","UnhealthyNPU":["Ascend910-4"],"NetworkUnhealthyNPU":["Ascend910-0"],"NodeDEnable":true,"NodeHealthState":"CardUnhealthy","UpdateTime":1744182212}]
remain-retry-times:
----


BinaryData
====

Events:  <none>
表1 vcjob-fault-npu-cm字段说明

名称

作用

取值

备注

fault-node

故障节点信息

-

-

- NodeName

节点名称

字符串

-

- UpdateTime

-

64位整数类型

-

- UnhealthyNPU

故障节点上芯片故障的芯片集合

字符串切片

-

- NetworkUnhealthyNPU

故障节点上网络故障的芯片集合

字符串切片

-

- NodeDEnable

节点状态检测开关是否打开

  • True
  • False

-

- NodeHealthState

节点健康状态

字符串

-

FaultDeviceList

-

-

-

- fault_type

故障类型对象

  • CardUnhealthy:芯片故障
  • CardNetworkUnhealthy:芯片网络故障
  • NodeUnhealthy:节点故障

-

- npu_name

故障的芯片名称,节点故障时为空

字符串

-

- fault_level

故障处理类型,节点故障时取值为空

  • NotHandleFault:不做处理
  • RestartRequest:推理场景需要重新执行推理请求,训练场景重新执行训练业务
  • RestartBusiness:需要重新执行业务
  • FreeRestartNPU:影响业务执行,待芯片空闲时需复位芯片
  • RestartNPU:直接复位芯片并重新执行业务
  • SeparateNPU:隔离芯片
  • PreSeparateNPU:预隔离芯片,会根据训练任务实际运行情况判断是否重调度
说明:

fault_level、fault_handling和large_model_fault_level参数功能一致,推荐使用fault_handling。

- fault_handling

- large_model_fault_level

- fault_code

故障码,由英文逗号拼接而成的字符串

字符串

  • Disconnected:芯片网络不连通故障。
  • heartbeatTimeOut:节点状态丢失故障

remain-retry-times

任务剩余可重调度信息

-

-

- UUID

任务UID

字符串

-

- Times

任务剩余可重调度次数

整数类型

-