Volcano收集了内部的芯片故障、参数面网络故障和节点故障信息,将其作为对外的信息放在K8s的ConfigMap中,以供外部查询和使用。
Name: vcjob-fault-npu-cm Namespace: volcano-system Labels: <none> Annotations: <none> Data ==== fault-node: ---- [{"FaultDeviceList":[{"fault_type":"CardNetworkUnhealthy","npu_name":"Ascend910-0","fault_level":"PreSeparateNPU","fault_handling":"PreSeparateNPU","large_model_fault_level":"PreSeparateNPU","fault_code":"81078603"},{"fault_type":"CardUnhealthy","npu_name":"Ascend910-4","fault_level":"SeparateNPU","fault_handling":"SeparateNPU","large_model_fault_level":"SeparateNPU","fault_code":"A8028801,A4028801,80E18402,80E18401"}],"NodeName":"node133","UnhealthyNPU":["Ascend910-4"],"NetworkUnhealthyNPU":["Ascend910-0"],"NodeDEnable":true,"NodeHealthState":"CardUnhealthy","UpdateTime":1744182212}] remain-retry-times: ---- BinaryData ==== Events: <none>
名称 |
作用 |
取值 |
备注 |
---|---|---|---|
fault-node |
故障节点信息 |
- |
- |
- NodeName |
节点名称 |
字符串 |
- |
- UpdateTime |
- |
64位整数类型 |
- |
- UnhealthyNPU |
故障节点上芯片故障的芯片集合 |
字符串切片 |
- |
- NetworkUnhealthyNPU |
故障节点上网络故障的芯片集合 |
字符串切片 |
- |
- NodeDEnable |
节点状态检测开关是否打开 |
|
- |
- NodeHealthState |
节点健康状态 |
字符串 |
- |
FaultDeviceList |
- |
- |
- |
- fault_type |
故障类型对象 |
|
- |
- npu_name |
故障的芯片名称,节点故障时为空 |
字符串 |
- |
- fault_level |
故障处理类型,节点故障时取值为空 |
|
说明:
fault_level、fault_handling和large_model_fault_level参数功能一致,推荐使用fault_handling。 |
- fault_handling |
|||
- large_model_fault_level |
|||
- fault_code |
故障码,由英文逗号拼接而成的字符串 |
字符串 |
|
remain-retry-times |
任务剩余可重调度信息 |
- |
- |
- UUID |
任务UID |
字符串 |
- |
- Times |
任务剩余可重调度次数 |
整数类型 |
- |