Instructions
This section describes how to use features of cluster scheduling components, including the scenario description, feature description, relationship between components and features, and the list of products supported by the Volcano scheduler and other schedulers.
Volcano and other schedulers cannot manage the same node resources.
Scenario Description
Training scenarios support the following features: resource monitoring, full NPU scheduling, static vNPU scheduling, resumable training, and elastic training.
Inference scenarios support the following features: resource monitoring, full NPU scheduling, static vNPU scheduling, dynamic vNPU scheduling, recovery of inference card faults, and rescheduling upon inference card faults.
In a cluster, training jobs and inference jobs may coexist. However, features exclusive to either training (resumable training and elastic training) or inference (dynamic vNPU scheduling, recovery of inference card faults, and rescheduling upon inference card faults) cannot be used concurrently in the same job.
Using the Volcano Scheduler
Table 1 describes the mapping between features supported by the cluster scheduling components and products. √ indicates that this feature can be used for training or inference jobs. × indicates that this feature is not supported.
Feature |
Training Job |
Training Job |
Inference Job |
|||||||
|---|---|---|---|---|---|---|---|---|---|---|
Product portfolio |
Atlas training product |
Inference server (equipped with Atlas 300I inference cards) |
Atlas inference product |
Atlas 800I A2 inference server |
A200I A2 Box heterogeneous component |
Atlas 800I A3 SuperPoD Server |
||||
Containerization |
√ |
√ |
√ |
√ |
√ |
√ |
√ |
√ |
√ |
√ |
Resource monitoring |
√ |
√ |
√ |
√ |
× |
× |
√ |
√ |
√ |
√ |
Full NPU scheduling |
√ |
√ |
√ |
√ |
× |
× |
√ |
√ |
√ |
√ |
Static vNPU scheduling |
√ |
× |
× |
× |
× |
× |
√ |
× |
× |
× |
Dynamic vNPU scheduling |
× |
× |
× |
× |
× |
× |
√ |
× |
× |
× |
Resumable training |
√ |
√ |
√ |
× |
× |
× |
× |
× |
× |
|
Elastic training |
√ |
× |
× |
× |
× |
× |
× |
× |
× |
× |
Recovery of inference card faults |
× |
× |
× |
√ |
× |
× |
√ |
√ |
√ |
√ |
Rescheduling upon inference card faults |
× |
× |
× |
√ |
× |
× |
√ |
√ |
√ |
√ |
- 1: Currently, this feature can be used only for MindIE Motor inference jobs.
- The Atlas 200I SoC A1 core board does not support dynamic vNPU scheduling.
- Currently, only the Atlas 900 A3 SuperPoD and Atlas 800T A3 SuperPoD Server among
Atlas A3 training product support full NPU scheduling and resumable training.
Component Installation Position |
Component Name |
Full NPU Scheduling or Static vNPU Scheduling |
Containerization |
Resource Monitoring |
Resumable Training |
Elastic Training |
Resumable Training |
Dynamic vNPU Scheduling |
Recovery of Inference Card Faults |
Rescheduling Upon Inference Card Faults |
|
|---|---|---|---|---|---|---|---|---|---|---|---|
Training |
Inference |
Training and Inference |
Training and Inference |
Training |
Training |
Inference |
Inference |
Inference |
Inference |
||
Management node |
Volcano |
√ |
√ |
× |
× |
√ |
√ |
√ |
√ |
√ |
√ |
Resilience Controller |
× |
× |
× |
× |
× |
√ |
× |
× |
× |
× |
|
Ascend Operator |
√ |
√ |
× |
× |
√ |
√ |
√ |
× |
× |
√ |
|
ClusterD |
√ |
√ |
× |
× |
√ |
√ |
√ |
√ |
√ |
√ |
|
Compute node |
Ascend Device Plugin |
√ |
√ |
× |
× |
√ |
√ |
√ |
√ |
√ |
√ |
Ascend Docker Runtime |
√ |
√ |
√ |
× |
√ |
√ |
× |
√ |
√ |
√ |
|
NodeD |
√ |
√ |
× |
× |
√ |
√ |
√ |
√ |
√ |
√ |
|
NPU Exporter |
× |
× |
× |
√ |
× |
× |
× |
× |
× |
× |
|
Training container |
Elastic Agent |
× |
× |
× |
× |
√ |
× |
× |
× |
× |
× |
TaskD |
× |
× |
× |
× |
√ |
× |
× |
× |
× |
× |
|
In the preceding table, resumable training can be used only for MindIE Motor inference jobs in the inference scenario.
Using Other Schedulers
If Volcano is not used as the scheduler, only containerization, resource monitoring, full NPU scheduling, static vNPU scheduling, and recovery of inference card faults are supported, as shown in Table 3. √ indicates that this feature can be used for training or inference jobs. × indicates that this feature is not supported.
Feature |
Training Job |
Training Job |
Inference Job |
|||||||
|---|---|---|---|---|---|---|---|---|---|---|
Product portfolio |
Atlas training product |
Inference server (equipped with Atlas 300I inference cards) |
Atlas inference product |
Atlas 800I A2 inference server |
A200I A2 Box heterogeneous component |
Atlas 800I A3 SuperPoD Server |
||||
Containerization |
√ |
√ |
√ |
√ |
√ |
√ |
√ |
√ |
√ |
√ |
Resource monitoring |
√ |
√ |
√ |
√ |
× |
× |
√ |
√ |
√ |
√ |
Full NPU scheduling |
√ |
√ |
√ |
√ |
× |
× |
√ |
√ |
√ |
√ |
Static vNPU scheduling |
√ |
× |
× |
× |
× |
× |
√ |
√ |
√ |
× |
Recovery of inference card faults |
× |
× |
× |
√ |
× |
× |
√ |
√ |
√ |
√ |
- The Atlas 200I SoC A1 core board does not support dynamic vNPU scheduling.
- Currently, only the Atlas 900 A3 SuperPoD and Atlas 800T A3 SuperPoD Server among
Atlas A3 training product support full NPU scheduling and resumable training.
Component Installation Position |
Component Name |
Full NPU Scheduling or Static vNPU Scheduling |
Containerization |
Resource Monitoring |
Recovery of Inference Card Faults |
|
|---|---|---|---|---|---|---|
Training |
Inference |
Training and Inference |
Training and Inference |
Inference |
||
Management node |
Resilience Controller |
× |
× |
× |
× |
× |
Ascend Operator |
√ |
√ |
× |
× |
× |
|
ClusterD |
√ |
√ |
× |
× |
√ |
|
Compute node |
Ascend Device Plugin |
√ |
√ |
× |
× |
√ |
Ascend Docker Runtime |
√ |
√ |
√ |
× |
√ |
|
NodeD |
√ |
√ |
× |
× |
√ |
|
NPU Exporter |
× |
× |
× |
√ |
× |
|
Training container |
Elastic Agent |
× |
× |
× |
× |
× |
TaskD |
× |
× |
× |
× |
× |
|