Instructions

This section describes how to use features of cluster scheduling components, including the scenario description, feature description, relationship between components and features, and the list of products supported by the Volcano scheduler and other schedulers.

Volcano and other schedulers cannot manage the same node resources.

Scenario Description

Training scenarios support the following features: resource monitoring, full NPU scheduling, static vNPU scheduling, resumable training, and elastic training.

Inference scenarios support the following features: resource monitoring, full NPU scheduling, static vNPU scheduling, dynamic vNPU scheduling, recovery of inference card faults, and rescheduling upon inference card faults.

In a cluster, training jobs and inference jobs may coexist. However, features exclusive to either training (resumable training and elastic training) or inference (dynamic vNPU scheduling, recovery of inference card faults, and rescheduling upon inference card faults) cannot be used concurrently in the same job.

Using the Volcano Scheduler

Table 1 describes the mapping between features supported by the cluster scheduling components and products. √ indicates that this feature can be used for training or inference jobs. × indicates that this feature is not supported.

**Table 1** Supported product models
Feature	Training Job		Training Job	Inference Job
Product portfolio	Atlas training product	Atlas A2 training product	Atlas A3 training product	Inference server (equipped with Atlas 300I inference cards)	Atlas 200/300/500 inference product	Atlas 200I/500 A2 inference product	Atlas inference product	Atlas 800I A2 inference server	A200I A2 Box heterogeneous component	Atlas 800I A3 SuperPoD Server
Containerization	√	√	√	√	√	√	√	√	√	√
Resource monitoring	√	√	√	√	×	×	√	√	√	√
Full NPU scheduling	√	√	√	√	×	×	√	√	√	√
Static vNPU scheduling	√	×	×	×	×	×	√	×	×	×
Dynamic vNPU scheduling	×	×	×	×	×	×	√	×	×	×
Resumable training	√	√	√	×	×	×	×	1	×	×
Elastic training	√	×	×	×	×	×	×	×	×	×
Recovery of inference card faults	×	×	×	√	×	×	√	√	√	√
Rescheduling upon inference card faults	×	×	×	√	×	×	√	√	√	√

1: Currently, this feature can be used only for MindIE Motor inference jobs.
The Atlas 200I SoC A1 core board does not support dynamic vNPU scheduling.
Currently, only the Atlas 900 A3 SuperPoD and Atlas 800T A3 SuperPoD Server among Atlas A3 training product support full NPU scheduling and resumable training.

**Table 2** Component features
Component Installation Position	Component Name	Full NPU Scheduling or Static vNPU Scheduling		Containerization	Resource Monitoring	Resumable Training	Elastic Training	Resumable Training	Dynamic vNPU Scheduling	Recovery of Inference Card Faults	Rescheduling Upon Inference Card Faults
Component Installation Position	Component Name	Training	Inference	Training and Inference	Training and Inference	Training	Training	Inference	Inference	Inference	Inference
Management node	Volcano	√	√	×	×	√	√	√	√	√	√
	Resilience Controller	×	×	×	×	×	√	×	×	×	×
	Ascend Operator	√	√	×	×	√	√	√	×	×	√
	ClusterD	√	√	×	×	√	√	√	√	√	√
Compute node	Ascend Device Plugin	√	√	×	×	√	√	√	√	√	√
	Ascend Docker Runtime	√	√	√	×	√	√	×	√	√	√
	NodeD	√	√	×	×	√	√	√	√	√	√
	NPU Exporter	×	×	×	√	×	×	×	×	×	×
Training container	Elastic Agent	×	×	×	×	√	×	×	×	×	×
Training container	TaskD	×	×	×	×	√	×	×	×	×	×

In the preceding table, resumable training can be used only for MindIE Motor inference jobs in the inference scenario.

Using Other Schedulers

If Volcano is not used as the scheduler, only containerization, resource monitoring, full NPU scheduling, static vNPU scheduling, and recovery of inference card faults are supported, as shown in Table 3. √ indicates that this feature can be used for training or inference jobs. × indicates that this feature is not supported.

**Table 3** Supported product models
Feature	Training Job		Training Job	Inference Job
Product portfolio	Atlas training product	Atlas A2 training product	Atlas A3 training product	Inference server (equipped with Atlas 300I inference cards)	Atlas 200/300/500 inference product	Atlas 200I/500 A2 inference product	Atlas inference product	Atlas 800I A2 inference server	A200I A2 Box heterogeneous component	Atlas 800I A3 SuperPoD Server
Containerization	√	√	√	√	√	√	√	√	√	√
Resource monitoring	√	√	√	√	×	×	√	√	√	√
Full NPU scheduling	√	√	√	√	×	×	√	√	√	√
Static vNPU scheduling	√	×	×	×	×	×	√	√	√	×
Recovery of inference card faults	×	×	×	√	×	×	√	√	√	√

The Atlas 200I SoC A1 core board does not support dynamic vNPU scheduling.
Currently, only the Atlas 900 A3 SuperPoD and Atlas 800T A3 SuperPoD Server among Atlas A3 training product support full NPU scheduling and resumable training.

**Table 4** Component features
Component Installation Position	Component Name	Full NPU Scheduling or Static vNPU Scheduling		Containerization	Resource Monitoring	Recovery of Inference Card Faults
Component Installation Position	Component Name	Training	Inference	Training and Inference	Training and Inference	Inference
Management node	Resilience Controller	×	×	×	×	×
	Ascend Operator	√	√	×	×	×
	ClusterD	√	√	×	×	√
Compute node	Ascend Device Plugin	√	√	×	×	√
	Ascend Docker Runtime	√	√	√	×	√
	NodeD	√	√	×	×	√
	NPU Exporter	×	×	×	√	×
Training container	Elastic Agent	×	×	×	×	×
Training container	TaskD	×	×	×	×	×

Parent topic: Feature Description