ClusterD

Application Scenario

A node can experience multiple faults, and if each node addresses faults independently, a task may end up in several recovery scenarios simultaneously. To coordinate job processing effectively, ClusterD is provided, which is deployed on the management node. ClusterD gathers data on cluster tasks, resources, faults, and their impact scope, analyzes this information across task, processor, and fault dimensions, and defines a standardized framework for fault-handling levels and policies.

Component Function

Obtain the processor, node, and network information from Ascend Device Plugin and NodeD, and obtain public fault information from ConfigMap or gRPC.
Summarize the preceding fault information for the upper-layer cluster scheduling services to call.
Establish a connection with the training container and control the training process to perform recomputation.
Interact with out-of-band services and transmit task information.

Upstream and Downstream Dependencies

Figure 1 Upstream and downstream dependencies

Obtain the processor information from Ascend Device Plugin on each compute node.
Obtain the health status of the CPU, memory, and hard drive of each compute node, DPC shared storage fault information, and UnifiedBus network fault information from NodeD on each compute node.
Obtain public fault information from ConfigMap or gRPC.
Summarize the resource information of the entire cluster and report the information to Ascend-volcano-plugin.
Monitor cluster task information and report information such as the task status and resource usage to CCAE.
Interact with processes in the container to control the training process for recomputation.

Parent topic: Component Description