Node Resources

mindx-dl-nodeinfo-<nodename>

When a fault occurs on a node, NodeD creates node-info-cm for fault reporting.

Table 1 mindx-dl-nodeinfo-<nodename>

Parameter

Description

NodeInfo

Node fault information

FaultDevList

List of faulty devices on a node

- DeviceType

Faulty device type

- DeviceId

ID of the faulty device

- FaultCode

Fault code, a string of characters (hexadecimal) consisted by English characters and numbers.

- FaultLevel

Fault handling level.

  • NotHandleFault: requires no handling.
  • PreSeparateFault: if there is a job running on the node, the fault is not handled, and no job is scheduled to the node.
  • SeparateFault: job rescheduling

NodeStatus

Node health status, which is determined by the device with the highest fault handling level on the node.

  • Healthy: The fault handling level on the node does not exceed NotHandleFault. The node is a healthy node and can be trained normally.
  • PreSeparate: The fault handling level on the node does not exceed PreSeparateFault. The node is a pre-isolated node and may not affect jobs temporarily. When the tasks are interrupted, the system does not schedule other jobs to the node.
  • UnHealthy: The fault handling level on the node is SeparateFault. The node is a faulty node and affects training. Jobs should be transferred immediately out of the node.