Instructions

This section describes how to use features of cluster scheduling components, including the scenario description, feature description, relationship between components and features, and the list of products supported by the Volcano scheduler and other schedulers.

Volcano and other schedulers cannot manage the same node resources.

Scenario Description

Training scenarios support the following features: resource monitoring, full NPU scheduling, static vNPU scheduling, resumable training, and elastic training.

Inference scenarios support the following features: resource monitoring, full NPU scheduling, static vNPU scheduling, dynamic vNPU scheduling, recovery of inference card faults, and rescheduling upon inference card faults.

In a cluster, training jobs and inference jobs may coexist. However, features exclusive to either training (resumable training and elastic training) or inference (dynamic vNPU scheduling, recovery of inference card faults, and rescheduling upon inference card faults) cannot be used concurrently in the same job.

Using the Volcano Scheduler

Table 1 describes the mapping between features supported by the cluster scheduling components and products. indicates that this feature can be used for training or inference jobs. × indicates that this feature is not supported.

Table 1 Supported product models

Feature

Training Job

Training Job

Inference Job

Product portfolio

Atlas training product

Atlas A2 training product

Atlas A3 training product

Inference server (equipped with Atlas 300I inference cards)

Atlas 200/300/500 inference product

Atlas 200I/500 A2 inference product

Atlas inference product

Atlas 800I A2 inference server

A200I A2 Box heterogeneous component

Atlas 800I A3 SuperPoD Server

Containerization

Resource monitoring

×

×

Full NPU scheduling

×

×

Static vNPU scheduling

×

×

×

×

×

×

×

×

Dynamic vNPU scheduling

×

×

×

×

×

×

×

×

×

Resumable training

×

×

×

×

1

×

×

Elastic training

×

×

×

×

×

×

×

×

×

Recovery of inference card faults

×

×

×

×

×

Rescheduling upon inference card faults

×

×

×

×

×

  • 1: Currently, this feature can be used only for MindIE Motor inference jobs.
  • The Atlas 200I SoC A1 core board does not support dynamic vNPU scheduling.
  • Currently, only the Atlas 900 A3 SuperPoD and Atlas 800T A3 SuperPoD Server among Atlas A3 training product support full NPU scheduling and resumable training.
Table 2 Component features

Component Installation Position

Component Name

Full NPU Scheduling or Static vNPU Scheduling

Containerization

Resource Monitoring

Resumable Training

Elastic Training

Resumable Training

Dynamic vNPU Scheduling

Recovery of Inference Card Faults

Rescheduling Upon Inference Card Faults

Training

Inference

Training and Inference

Training and Inference

Training

Training

Inference

Inference

Inference

Inference

Management node

Volcano

×

×

Resilience Controller

×

×

×

×

×

×

×

×

×

Ascend Operator

×

×

×

×

ClusterD

×

×

Compute node

Ascend Device Plugin

×

×

Ascend Docker Runtime

×

×

NodeD

×

×

NPU Exporter

×

×

×

×

×

×

×

×

×

Training container

Elastic Agent

×

×

×

×

×

×

×

×

×

TaskD

×

×

×

×

×

×

×

×

×

In the preceding table, resumable training can be used only for MindIE Motor inference jobs in the inference scenario.

Using Other Schedulers

If Volcano is not used as the scheduler, only containerization, resource monitoring, full NPU scheduling, static vNPU scheduling, and recovery of inference card faults are supported, as shown in Table 3. indicates that this feature can be used for training or inference jobs. × indicates that this feature is not supported.

Table 3 Supported product models

Feature

Training Job

Training Job

Inference Job

Product portfolio

Atlas training product

Atlas A2 training product

Atlas A3 training product

Inference server (equipped with Atlas 300I inference cards)

Atlas 200/300/500 inference product

Atlas 200I/500 A2 inference product

Atlas inference product

Atlas 800I A2 inference server

A200I A2 Box heterogeneous component

Atlas 800I A3 SuperPoD Server

Containerization

Resource monitoring

×

×

Full NPU scheduling

×

×

Static vNPU scheduling

×

×

×

×

×

×

Recovery of inference card faults

×

×

×

×

×

  • The Atlas 200I SoC A1 core board does not support dynamic vNPU scheduling.
  • Currently, only the Atlas 900 A3 SuperPoD and Atlas 800T A3 SuperPoD Server among Atlas A3 training product support full NPU scheduling and resumable training.
Table 4 Component features

Component Installation Position

Component Name

Full NPU Scheduling or Static vNPU Scheduling

Containerization

Resource Monitoring

Recovery of Inference Card Faults

Training

Inference

Training and Inference

Training and Inference

Inference

Management node

Resilience Controller

×

×

×

×

×

Ascend Operator

×

×

×

ClusterD

×

×

Compute node

Ascend Device Plugin

×

×

Ascend Docker Runtime

×

NodeD

×

×

NPU Exporter

×

×

×

×

Training container

Elastic Agent

×

×

×

×

×

TaskD

×

×

×

×

×