Fault Information
Ascend Device Plugin collects internal processor faults, parameter plane network faults, and node faults, and stores them as external information in the ConfigMap of Kubernetes. One ConfigMap stores information about one node for external query and use.
Query command: kubectl describe cm -n kube-system mindx-dl-deviceinfo-${node_name}
{"DeviceInfo":{"DeviceList":{"huawei.com/Ascend910":"Ascend910-0,Ascend910-1,Ascend910-2,Ascend910-3,Ascend910-5,Ascend910-6,Ascend910-7","huawei.com/Ascend910-Fault":"[{\"fault_type\":\"CardNetworkUnhealthy\",\"npu_name\":\"Ascend910-0\",\"large_model_fault_level\":\"PreSeparateNPU\",\"fault_level\":\"PreSeparateNPU\",\"fault_handling\":\"PreSeparateNPU\",\"fault_code\":\"81078603\",\"fault_time_and_level_map\":{\"81078603\":{\"fault_time\":1744168468259,\"fault_level\":\"PreSeparateNPU\"}}},{\"fault_type\":\"CardUnhealthy\",\"npu_name\":\"Ascend910-4\",\"large_model_fault_level\":\"SeparateNPU\",\"fault_level\":\"SeparateNPU\",\"fault_handling\":\"SeparateNPU\",\"fault_code\":\"A8028801,A4028801,80E18402,80E18401\",\"fault_time_and_level_map\":{\"80E18401\":{\"fault_time\":1744167455784,\"fault_level\":\"NotHandleFault\"},\"80E18402\":{\"fault_time\":1744167455784,\"fault_level\":\"SeparateNPU\"},\"A4028801\":{\"fault_time\":1744167455784,\"fault_level\":\"NotHandleFault\"},\"A8028801\":{\"fault_time\":1744167455784,\"fault_level\":\"SeparateNPU\"}}}]","huawei.com/Ascend910-NetworkUnhealthy":"Ascend910-0","huawei.com/Ascend910-Recovering":"","huawei.com/Ascend910-Unhealthy":"Ascend910-4"},"UpdateTime":1744182144},"SuperPodID":-2,"ServerIndex":-2,"CheckCode":"a550811fdfafb5717555526816af2ca4ac6c3e102f5907574048578e0c8fcc73"}
Parameter |
Description |
|---|---|
huawei.com/Ascend910 |
Name of the available processor on the current node. If there are multiple processors, use commas (,) to separate them. NOTE:
This field has been unavailable and will not be displayed in later versions. By default, the available processors of a node are maintained by Volcano, and this field does not take effect. To make the field take effect, change the value of the Volcano configuration parameter self-maintain-available-card to false. |
huawei.com/Ascend910-NetworkUnhealthy |
Name of the processor with unhealthy network on the current node. If there are multiple processors, use commas (,) to separate them. |
huawei.com/Ascend910-Unhealthy |
Name of the unhealthy processor on the current node. If there are multiple processors, use commas (,) to separate them. |
huawei.com/Ascend910-Recovering |
Processor being restored on the current node. If there are multiple processors, use commas (,) to separate them. |
huawei.com/Ascend910-Fault |
Array object, including fault_type, npu_name, large_model_fault_level, fault_level, fault_handling, fault_code, and fault_time_and_level_map. |
- fault_type |
Fault type.
|
- npu_name |
Name of the faulty processor. This parameter is left empty if the node is faulty. |
- large_model_fault_level |
Fault handling type. This parameter is left empty if the node is faulty.
NOTE:
|
- fault_level |
|
- fault_handling |
|
- fault_code |
Fault code, a string of characters separated by commas (,). For details about processor fault codes, see Processor Fault Code Reference Documents. |
-fault_time_and_level_map |
Fault code, fault occurrence time, and fault handling level. |
SuperPodID |
SuperPoD ID. |
ServerIndex |
Relative position of the current node in a SuperPoD NOTE:
|
CheckCode |
Verification code. |