NodeD
Application Scenario
If the CPU, memory, or hard drive of a node is faulty, training jobs will fail. To ensure that training jobs can quickly exit when a node is faulty and new jobs are not scheduled to the faulty node, NodeD can detect node exceptions.
Component Function
- Obtain node exceptions from IPMI and report the exceptions to the upper-layer service for resource scheduling.
- Periodically send node status information to the upper-layer service for resource scheduling.
Upstream and Downstream Dependencies
Figure 1 Upstream and downstream dependencies


- Obtain the CPU, memory, and hard drive fault information of compute nodes from IPMI.
- Report the CPU, memory, and hard drive fault information of compute nodes to ClusterD.
Parent topic: Component Description