Cluster Scheduling Scenario (Full-Stack)
Scenario
This applies to the scenario where you have one or more NPU servers and need to use Kubernetes for management. In addition, features such as NPU Device Management, NPU Scheduling Optimization, Resumable Training, and Rescheduling Upon Inference Card Faults can be used. In this scenario, Docker, Kubernetes, and NPU cluster scheduling components are installed on the NPU server.
List of Components to Be Installed
Component |
Description |
|---|---|
Docker |
Container engine |
K8s |
Container orchestration system |
Ascend Docker Runtime |
Allows containers to use Ascend NPUs. |
Ascend Device Plugin |
Supports NPU device management. |
Volcano |
Supports NPU scheduling optimization, resumable training, and rescheduling upon inference card faults. |
HCCL-Controller |
Generates the ranktable file (also called the hccl.json file) for NPU training jobs. |
NodeD |
Supports resumable training (upon node faults). |
NPU-Exporter |
Supports monitoring of NPU device management status. |