Cluster Scheduling Scenario
Scenario
In this scenario, you already have a Kubernetes cluster and need to manage new NPU servers. In addition, features such as NPU Device Management, NPU Scheduling Optimization, Resumable Training, and Rescheduling Upon Inference Card Faults can be used. You need to deploy the NPU management component on the master node of the existing Kubernetes cluster and deploy the NPU management component on the worker node on the newly managed NPU node.
List of Components to Be Installed
Component |
Description |
|---|---|
Ascend Docker Runtime |
Allows containers to use Ascend NPUs. |
Ascend Device Plugin |
Supports NPU device management. |
Volcano |
Supports NPU scheduling optimization, resumable training, and rescheduling upon inference card faults. |
(Optional) HCCL-Controller |
Generates the ranktable file (also called the hccl.json file) for NPU training jobs. Install it when you need to use the function. |
(Optional) NodeD |
Supports resumable training upon node faults. Install it when you need to use the function. |
(Optional) NPU-Exporter |
Supports monitoring of NPU device management status. Install it when you need to use the function. |