Cluster Scheduling Scenario (Full-Stack)

This applies to the scenario where you have one or more NPU servers and need to use Kubernetes for management. In addition, features such as NPU Device Management, NPU Scheduling Optimization, Resumable Training, and Rescheduling Upon Inference Card Faults can be used. In this scenario, Docker, Kubernetes, and NPU cluster scheduling components are installed on the NPU server.

**Table 1** Components to be installed in the cluster scheduling scenario (full-stack)
Component	Description
Docker	Container engine
K8s	Container orchestration system
Ascend Docker Runtime	Allows containers to use Ascend NPUs.
Ascend Device Plugin	Supports NPU device management.
Volcano	Supports NPU scheduling optimization, resumable training, and rescheduling upon inference card faults.
HCCL-Controller	Generates the ranktable file (also called the hccl.json file) for NPU training jobs.
NodeD	Supports resumable training (upon node faults).
NPU-Exporter	Supports monitoring of NPU device management status.

Parent topic: Supported Installation Scenarios