Volcano
Volcano收集了内部的芯片故障、参数面网络故障和节点故障信息,将其作为对外的信息放在K8s的ConfigMap中,以供外部查询和使用。
查询命令为kubectl describe cm -n volcano-system  vcjob-fault-npu-cm,命令回显示例如下,关键参数说明请参见表1。
Name:         vcjob-fault-npu-cm
Namespace:    volcano-system
Labels:       <none>
Annotations:  <none>
Data
====
fault-node:
----
[{"FaultDeviceList":[{"fault_type":"CardNetworkUnhealthy","npu_name":"Ascend910-0","fault_level":"PreSeparateNPU","fault_handling":"PreSeparateNPU","large_model_fault_level":"PreSeparateNPU","fault_code":"81078603"},{"fault_type":"CardUnhealthy","npu_name":"Ascend910-4","fault_level":"SeparateNPU","fault_handling":"SeparateNPU","large_model_fault_level":"SeparateNPU","fault_code":"A8028801,A4028801,80E18402,80E18401"}],"NodeName":"node133","UnhealthyNPU":["Ascend910-4"],"NetworkUnhealthyNPU":["Ascend910-0"],"NodeDEnable":true,"NodeHealthState":"CardUnhealthy","UpdateTime":1744182212}]
remain-retry-times:
----
BinaryData
====
Events:  <none>
名称  | 
作用  | 
取值  | 
备注  | 
|---|---|---|---|
fault-node  | 
故障节点信息  | 
-  | 
-  | 
- NodeName  | 
节点名称  | 
字符串  | 
-  | 
- UpdateTime  | 
-  | 
64位整数类型  | 
-  | 
- UnhealthyNPU  | 
故障节点上芯片故障的芯片集合  | 
字符串切片  | 
-  | 
- NetworkUnhealthyNPU  | 
故障节点上网络故障的芯片集合  | 
字符串切片  | 
-  | 
- NodeDEnable  | 
节点状态检测开关是否打开  | 
  | 
-  | 
- NodeHealthState  | 
节点健康状态  | 
字符串  | 
-  | 
FaultDeviceList  | 
-  | 
-  | 
-  | 
- fault_type  | 
故障类型对象  | 
  | 
-  | 
- npu_name  | 
故障的芯片名称,节点故障时为空  | 
字符串  | 
-  | 
- fault_level  | 
故障处理类型,节点故障时取值为空  | 
  | 
 说明:  fault_level、fault_handling和large_model_fault_level参数功能一致,推荐使用fault_handling。  | 
- fault_handling  | 
|||
- large_model_fault_level  | 
|||
- fault_code  | 
故障码,由英文逗号拼接而成的字符串  | 
字符串  | 
  | 
remain-retry-times  | 
任务剩余可重调度信息  | 
-  | 
-  | 
- UUID  | 
任务UID  | 
字符串  | 
-  | 
- Times  | 
任务剩余可重调度次数  | 
整数类型  | 
-  | 
父主题: 查询上报的故障信息