Cluster Scheduling Scenario

Scenario

In this scenario, you already have a Kubernetes cluster and need to manage new NPU servers. In addition, features such as NPU Device Management, NPU Scheduling Optimization, Resumable Training, and Rescheduling Upon Inference Card Faults can be used. You need to deploy the NPU management component on the master node of the existing Kubernetes cluster and deploy the NPU management component on the worker node on the newly managed NPU node.

List of Components to Be Installed

**Table 1** Components to be installed in the cluster scheduling scenario
Component	Description
Ascend Docker Runtime	Allows containers to use Ascend NPUs.
Ascend Device Plugin	Supports NPU device management.
Volcano	Supports NPU scheduling optimization, resumable training, and rescheduling upon inference card faults.
(Optional) HCCL-Controller	Generates the ranktable file (also called the hccl.json file) for NPU training jobs. Install it when you need to use the function.
(Optional) NodeD	Supports resumable training upon node faults. Install it when you need to use the function.
(Optional) NPU-Exporter	Supports monitoring of NPU device management status. Install it when you need to use the function.

Parent topic: Supported Installation Scenarios