Volcano

Volcano收集了内部的芯片故障、参数面网络故障和节点故障信息,将其作为对外的信息放在K8sConfigMap中,以供外部查询和使用。

查询命令为kubectl describe cm -n volcano-system vcjob-fault-npu-cm,命令回显示例如下,关键参数说明请参见表1
Name:         vcjob-fault-npu-cm
Namespace:    volcano-system
Labels:       <none>
Annotations:  <none>
Data
====
checkCode:
----
e2d9860917e667cb0c5f3c989938fa99162d7f2bd598ff17a65e2bec37caaed5
fault-job-910x8:
----
fault-node:
----
[{"NodeName":"k8smaster","FaultDeviceList":[{"fault_type":"CardUnhealthy","npu_name":"Ascend910-4","fault_level":"SeparateNPU","large_model_fault_level":"SeparateNPU","fault_code":"A4028801,A8028801,80E18402,80E18401"}],"UpdateTime":1700019078,"UnhealthyNPU":["Ascend910-4"],"NetworkUnhealthyNPU":null,"IsFaultNode":true,"NodeDEnable":false,"NodeHealthState":"CardUnhealthy","AllCards":["Ascend910-2","Ascend910-3","Ascend910-4"],"FaultCards":[{"IsFaultCard":false,"NPUName":"Ascend910-2","NodeName":"k8smaster","FaultType":"Healthy"},{"IsFaultCard":false,"NPUName":"Ascend910-3","NodeName":"k8smaster","FaultType":"Healthy"},{"IsFaultCard":true,"NPUName":"Ascend910-4","NodeName":"k8smaster","FaultType":"Unhealthy"}],"HeartbeatInterval":5,"OldHeartbeatTime":1700468411,"NewHeartbeatTime":1700468416,"UpdateHeartbeatTime":1700468417}]
node-heartbeat:
----
[{"NodeName":"k8smaster","HeartbeatTime":1700468416,"UpdateTime":1700468417}]
node-rankIndex-Occurrence:
----
{}
remain-retry-times:
----
{}
Events:  <none>
表1 回显参数说明

参数名

描述

fault-node

节点维度的故障信息

NodeName

节点名称

FaultDeviceList

故障列表

- fault_type

故障类型对象,对象包含fault_type、npu_name、large_model_fault_level、fault_level、fault_handling和fault_code等6个字段

  • NodeUnhealthy:节点故障
  • CardUnhealthy:芯片故障
  • CardNetworkUnhealthy:参数面网络故障(芯片网络相关故障)

- npu_name

故障的芯片名称,节点故障时为空

- large_model_fault_level

故障处理类型,节点故障时取值为空

  • NotHandleFault:不做处理
  • RestartRequest:推理场景需要重新执行推理请求,训练场景重新执行训练业务
  • RestartBusiness:需要重新执行业务
  • FreeRestartNPU:直接复位芯片并重新执行业务
  • RestartNPU:直接复位芯片并重新执行业务
  • SeparateNPU:隔离芯片
  • PreSeparateNPU:预隔离芯片,根据训练任务实际运行情况判断是否重调度
说明:

large_model_fault_level、fault_level和fault_handling参数功能一致,推荐使用fault_handling。

- fault_level

- fault_handling

- fault_code

故障码,由英文逗号拼接而成的字符串

  • Disconnected:芯片网络不连通故障
  • heartbeatTimeOut:节点心跳丢失故障
  • 其他故障码的详细信息,可以参见芯片故障码参考文档获取对应的参考文档。