Volcano
MindCluster Volcano收集了内部的芯片故障、参数面网络故障和节点故障信息,将其作为对外的信息放在K8s的ConfigMap中,以供外部查询和使用。
查询命令为kubectl describe cm -n volcano-system vcjob-fault-npu-cm,命令回显示例如下,关键参数说明请参见表1。
Name: vcjob-fault-npu-cm
Namespace: volcano-system
Labels: <none>
Annotations: <none>
Data
====
checkCode:
----
e2d9860917e667cb0c5f3c989938fa99162d7f2bd598ff17a65e2bec37caaed5
fault-job-910x8:
----
fault-node:
----
[{"NodeName":"k8smaster","FaultDeviceList":[{"fault_type":"CardUnhealthy","npu_name":"Ascend910-4","fault_level":"SeparateNPU","large_model_fault_level":"SeparateNPU","fault_code":"A4028801,A8028801,80E18402,80E18401"}],"UpdateTime":1700019078,"UnhealthyNPU":["Ascend910-4"],"NetworkUnhealthyNPU":null,"IsFaultNode":true,"NodeDEnable":false,"NodeHealthState":"CardUnhealthy","AllCards":["Ascend910-2","Ascend910-3","Ascend910-4"],"FaultCards":[{"IsFaultCard":false,"NPUName":"Ascend910-2","NodeName":"k8smaster","FaultType":"Healthy"},{"IsFaultCard":false,"NPUName":"Ascend910-3","NodeName":"k8smaster","FaultType":"Healthy"},{"IsFaultCard":true,"NPUName":"Ascend910-4","NodeName":"k8smaster","FaultType":"Unhealthy"}],"HeartbeatInterval":5,"OldHeartbeatTime":1700468411,"NewHeartbeatTime":1700468416,"UpdateHeartbeatTime":1700468417}]
node-heartbeat:
----
[{"NodeName":"k8smaster","HeartbeatTime":1700468416,"UpdateTime":1700468417}]
node-rankIndex-Occurrence:
----
{}
remain-retry-times:
----
{}
Events: <none>
参数名 |
描述 |
|---|---|
fault-node |
节点维度的故障信息。 |
NodeName |
节点名称。 |
FaultDeviceList |
故障列表。 |
- fault_type |
故障类型对象,对象包含fault_type、npu_name、large_model_fault_level、fault_level、fault_handling和fault_code等6个字段
|
- npu_name |
故障的芯片名称,节点故障时为空。 |
- large_model_fault_level |
故障处理类型,节点故障时取值为空。
说明:
large_model_fault_level、fault_level和fault_handling参数功能一致,推荐使用fault_handling。 |
- fault_level |
|
- fault_handling |
|
- fault_code |
故障码,由英文逗号拼接而成的字符串 |
FaultTasks |
任务维度的故障信息列表,包含Reason字段 |
- Reason |
故障原因,内容为FaultDeviceList的字段拼接而成的字符串。 |
父主题: 查询上报的故障信息