NodeD

NodeD collects node hardware fault and health information and stores it as external information in Kubernetes ConfigMap for external query and use.

The query command is kubectl describe cm mindx-dl-nodeinfo-<nodename> -n mindx-dl. The command output is as follows. For details about key parameters, see Table 1.
Name:         mindx-dl-nodeinfo-<nodename>
Namespace:    mindx-dl
Labels:       <none>
Annotations:  <none>

Data
====
NodeInfo:
----
{"NodeInfo":{"FaultDevList":[{"DeviceType":"CPU","DeviceId":1,"FaultCode":["00000011"],"FaultLevel":"SeparateFault"}],"NodeStatus":"UnHealthy"},"CheckCode":"3a2934c3cb875f2256c770c75a6fdf24594fcf64481ac6cd0d0f74b8fea88855"}
Events:  <none>
Table 1 Parameters in the command outputs

Parameter

Description

NodeInfo

Node fault information

FaultDevList

List of faulty devices on a node

- DeviceType

Faulty device type

- DeviceId

ID of the faulty device

- FaultCode

Fault code, a string of characters (hexadecimal) consisted by English characters and numbers.

- FaultLevel

Fault handling level

  • NotHandleFault: requires no handling.
  • PreSeparateFault: if there is a task running on the node, the fault is not handled, and no task is scheduled to the node.
  • SeparateFault: job rescheduling

NodeStatus

Node health status, which is determined by the device with the highest fault handling level on the node.

  • Healthy: The fault handling level on the node does not exceed NotHandleFault. The node is a healthy node and can be trained normally.
  • PreSeparate: The fault handling level on the node does not exceed PreSeparateFault. The node is a pre-isolated node and may not affect tasks temporarily. When the tasks are interrupted, the system does not schedule other tasks to the node.
  • UnHealthy: The fault handling level on the node is SeparateFault. The node is a faulty node, and will affect the training job. Transfer the task immediately from the node.

CheckCode

Verification code.