Fault Event Information

The fault events collected by Ascend Device Plugin can be reported through Kubernetes events. The query command is kubectl get events -n kube-system. The following uses Atlas training product as an example to illustrate the query result. For details about the parameters, see Table 1.
NAMESPACE     LAST SEEN   TYPE      REASON     OBJECT                                         MESSAGE
kube-system   8s          Warning   Occur      pod/ascend-device-plugin-daemonset-910-dlpmv   device fault, nodeName:k8smaster, assertion:Occur, cardID:2, deviceID:0, faultCodes:8C084E00, faultLevelName:RestartBusiness, alarmRaisedTime:2023-11-21 05:36:53
Table 1 Parameters

Parameter

Description

NAMESPACE

Name of the namespace. The value is kube-system.

LAST SEEN

Event occurrence time.

TYPE

Event type. The value can be Normal or Warning.

REASON

Event cause. The options are as follows:

  • Occur: fault occurred
  • Recovery: fault recovery
  • Notice: notification

OBJECT

Event object. The format is pod/Pod name of Ascend Device Plugin, for example, pod/ascend-device-plugin-daemonset-910-dlpmv.

MESSAGE

Event description. The fields in the event content are described as follows:

  • nodeName: node name
  • assertion: information type
    • Occur: fault occurred
    • Recovery: fault recovery
    • Notice: notification
  • cardID: NPU module ID (NPU ID)
  • deviceID: device ID
  • faultCodes: fault code, for example, 8C084E00.
  • faultLevelName: fault level name
    • NotHandleFault: requires no handling.
    • RestartRequest: re-executes service requests when faults affect service execution.
    • RestartBusiness: re-executes services when faults affect service execution.
    • FreeRestartNPU: resets idle processors when faults affect service execution.
    • RestartNPU: directly resets processors and re-executes services.
    • SeparateNPU: isolates processors.
    • PreSeparateNPU: services are not affected temporarily. Jobs will not be scheduled to the processor.
    • SubHealthFault: see the value of subHealthyStrategy in the YAML file.
  • alarmRaisedTime: fault occurrence time