Component Installation Positions
Table 1 shows the installation positions of the components.
Node |
Component |
Function |
|---|---|---|
Master node |
HCCL-Controller |
A plug-in developed based on the Kubernetes informer mechanism. It automatically generates the HCCL configuration file (ranktable file, also called the hccl.json file) for NPU training jobs. |
Volcano |
Enhances the affinity scheduling function of AI Processors based on the cluster scheduling function of the open source Volcano. |
|
Resilience-Controller |
Provides resilience control for the minimum training system. When the hardware used by a training job is faulty, the hardware is removed to continue the training. |
|
Worker node |
Ascend Device Plugin |
Provides the common device plugin mechanism and standard device APIs for Kubernetes to use devices. |
NPU-Exporter |
Monitors Ascend AI Processors and container-level allocation status based on Prometheus. |
|
NodeD |
Provides node monitoring functions, such as node heartbeat reporting. |
|
Ascend Docker Runtime |
Provides container-based Ascend NPUs (Ascend AI Processors) for all AI jobs so that AI jobs can smoothly run on Ascend devices as Docker containers. |
|
Training container |
Elastic-Agent |
Provides functions such as the dying gasp (the last CKPT file) for resumable training and restoration policy in data parallel and hybrid parallel scenarios. To use the dying gasp function, you need to install this component in the training container. |