Component Installation Positions

Table 1 shows the installation positions of the components.

Table 1 Component installation positions

Node

Component

Function

Master node

HCCL-Controller

A plug-in developed based on the Kubernetes informer mechanism. It automatically generates the HCCL configuration file (ranktable file, also called the hccl.json file) for NPU training jobs.

Volcano

Enhances the affinity scheduling function of AI Processors based on the cluster scheduling function of the open source Volcano.

Resilience-Controller

Provides resilience control for the minimum training system. When the hardware used by a training job is faulty, the hardware is removed to continue the training.

Worker node

Ascend Device Plugin

Provides the common device plugin mechanism and standard device APIs for Kubernetes to use devices.

NPU-Exporter

Monitors Ascend AI Processors and container-level allocation status based on Prometheus.

NodeD

Provides node monitoring functions, such as node heartbeat reporting.

Ascend Docker Runtime

Provides container-based Ascend NPUs (Ascend AI Processors) for all AI jobs so that AI jobs can smoothly run on Ascend devices as Docker containers.

Training container

Elastic-Agent

Provides functions such as the dying gasp (the last CKPT file) for resumable training and restoration policy in data parallel and hybrid parallel scenarios. To use the dying gasp function, you need to install this component in the training container.