Fault Information

Ascend Device Plugin collects internal processor faults, parameter plane network faults, and node faults, and stores them as external information in the ConfigMap of Kubernetes. One ConfigMap stores information about one node for external query and use.

Query command: kubectl describe cm -n kube-system mindx-dl-deviceinfo-${node_name}

The following uses Atlas A3 training product as an example to illustrate the query result. Parameters in the command output may vary according to device types. For details about the key parameters, see Table 1.
{"DeviceInfo":{"DeviceList":{"huawei.com/Ascend910":"Ascend910-0,Ascend910-1,Ascend910-2,Ascend910-3,Ascend910-5,Ascend910-6,Ascend910-7","huawei.com/Ascend910-Fault":"[{\"fault_type\":\"CardNetworkUnhealthy\",\"npu_name\":\"Ascend910-0\",\"large_model_fault_level\":\"PreSeparateNPU\",\"fault_level\":\"PreSeparateNPU\",\"fault_handling\":\"PreSeparateNPU\",\"fault_code\":\"81078603\",\"fault_time_and_level_map\":{\"81078603\":{\"fault_time\":1744168468259,\"fault_level\":\"PreSeparateNPU\"}}},{\"fault_type\":\"CardUnhealthy\",\"npu_name\":\"Ascend910-4\",\"large_model_fault_level\":\"SeparateNPU\",\"fault_level\":\"SeparateNPU\",\"fault_handling\":\"SeparateNPU\",\"fault_code\":\"A8028801,A4028801,80E18402,80E18401\",\"fault_time_and_level_map\":{\"80E18401\":{\"fault_time\":1744167455784,\"fault_level\":\"NotHandleFault\"},\"80E18402\":{\"fault_time\":1744167455784,\"fault_level\":\"SeparateNPU\"},\"A4028801\":{\"fault_time\":1744167455784,\"fault_level\":\"NotHandleFault\"},\"A8028801\":{\"fault_time\":1744167455784,\"fault_level\":\"SeparateNPU\"}}}]","huawei.com/Ascend910-NetworkUnhealthy":"Ascend910-0","huawei.com/Ascend910-Recovering":"","huawei.com/Ascend910-Unhealthy":"Ascend910-4"},"UpdateTime":1744182144},"SuperPodID":-2,"ServerIndex":-2,"CheckCode":"a550811fdfafb5717555526816af2ca4ac6c3e102f5907574048578e0c8fcc73"}
Table 1 Parameters

Parameter

Description

huawei.com/Ascend910

Name of the available processor on the current node. If there are multiple processors, use commas (,) to separate them.

NOTE:

This field has been unavailable and will not be displayed in later versions. By default, the available processors of a node are maintained by Volcano, and this field does not take effect. To make the field take effect, change the value of the Volcano configuration parameter self-maintain-available-card to false.

huawei.com/Ascend910-NetworkUnhealthy

Name of the processor with unhealthy network on the current node. If there are multiple processors, use commas (,) to separate them.

huawei.com/Ascend910-Unhealthy

Name of the unhealthy processor on the current node. If there are multiple processors, use commas (,) to separate them.

huawei.com/Ascend910-Recovering

Processor being restored on the current node. If there are multiple processors, use commas (,) to separate them.

huawei.com/Ascend910-Fault

Array object, including fault_type, npu_name, large_model_fault_level, fault_level, fault_handling, fault_code, and fault_time_and_level_map.

- fault_type

Fault type.

  • CardUnhealthy: processor fault.
  • CardNetworkUnhealthy: parameter plane network fault (processor network fault)
  • NodeUnhealthy: node fault.

- npu_name

Name of the faulty processor. This parameter is left empty if the node is faulty.

- large_model_fault_level

Fault handling type. This parameter is left empty if the node is faulty.

  • NotHandleFault: requires no handling.
  • RestartRequest: re-executes inference requests in the inference scenario, or re-executes services in the training scenario.
  • RestartBusiness: re-executes services.
  • FreeRestartNPU: resets idle processors when faults affect service execution.
  • RestartNPU: directly resets processors and re-executes services.
  • SeparateNPU: isolates processors.
  • PreSeparateNPU: pre-isolates processors and determines whether to perform rescheduling based on the actual running status of the training job.
NOTE:
  • The functions of large_model_fault_level, fault_handling and fault_level are the same. fault_handling is recommended.
  • When an inference job subscribes to fault information, a RestartRequest fault on its inference card will not trigger rescheduling if the fault lasts 60 seconds or less. If the fault persists for more than 60 seconds, the related processor is isolated and the job is rescheduled.

- fault_level

- fault_handling

- fault_code

Fault code, a string of characters separated by commas (,). For details about processor fault codes, see Processor Fault Code Reference Documents.

-fault_time_and_level_map

Fault code, fault occurrence time, and fault handling level.

SuperPodID

SuperPoD ID.

ServerIndex

Relative position of the current node in a SuperPoD

NOTE:
  • When the value of SuperPodID or ServerIndex reported by the driver is 0xffffffff, the value of SuperPodID or ServerIndex is -1.
  • The value of SuperPodID or ServerIndex is -2 in the following situations:
    • The current device does not support the query of SuperPoD information.
    • The SuperPoD information fails to be obtained due to a driver problem.

CheckCode

Verification code.