NodeD

Application Scenario

If the CPU, memory, or hard drive of a node is faulty, training jobs will fail. To ensure that training jobs can quickly exit when a node is faulty and new jobs are not scheduled to the faulty node, NodeD can detect node exceptions.

Component Function

  • Obtain node exceptions from IPMI and report the exceptions to the upper-layer service for resource scheduling.
  • Periodically send node status information to the upper-layer service for resource scheduling.

Upstream and Downstream Dependencies

Figure 1 Upstream and downstream dependencies
  1. Obtain the CPU, memory, and hard drive fault information of compute nodes from IPMI.
  2. Report the CPU, memory, and hard drive fault information of compute nodes to ClusterD.