Feature Description

Basic scheduling involves the following features:

Training job: full NPU scheduling, elastic training, and static vNPU scheduling. For details about how to use resumable training, see Resumable Training.
Inference job: full NPU scheduling, static vNPU scheduling, dynamic vNPU scheduling, recovery of inference card faults, and rescheduling upon inference card faults.
These features are implemented by different components. For details, see Basic Scheduling.

This document describes how to deploy and execute a model training or inference job using NPUs. The example in this section is for reference only. Since the production environment differs, configurations must be adjusted accordingly.

Job Types

Ascend Operator provides the following methods to configure resource information:

Configuring resource information using environment variables: Environment variables are provided for distributed training jobs of different AI frameworks. For details, see Environment Variables of Ascend Operator. You can use this method to create only Ascend Job (acjob) objects.
Configuring resource information using a file: Collective communication configuration file (RankTable file, also referred to as hccl.json) for a training job. You can use this method to create three types of objects: Volcano Job (vcjob), Ascend Job (acjob), and Deployment (deploy).
- (Recommended) Ascend Job (acjob): a job type customized by MindCluster. You can start a training or inference job by configuring resource information using environment variables or files.
  Each acjob YAML file contains some fixed fields, such as apiVersion and kind. For more information about these fields, see Key Fields in acjob.
- Volcano Job (vcjob): applies to batch processing jobs that have the completed status.
- Deployment (deploy): applies to jobs that are always running in the background, which do not have the completed status. You can select this type when you need to continuously train jobs, continuously occupy resources, debug training jobs, or provide inference service APIs.
  
  A Deployment job cannot be updated. To update a Deployment job, delete it and then create another one.

Scheduling Time Description

The following content describes the reference time for scheduling an acjob on the Atlas 800T A2 training server in the Volcano multi-job or single-job scenario. To reach the following reference time, ensure that the CPU frequency is at least 2.60 GHz and the API Server latency is less than 80 ms. The scheduling time refers to the time when a job is delivered to the pod in the Running status.

Multi-job scheduling time description
- A maximum of 100 single-server single-processor jobs can be created by 100 YAML files concurrently, with a scheduling time of 107 seconds.
- Five single-server single-processor jobs are created per second. After one minute, 300 single-server single-device jobs can be created, with a scheduling time of 293 seconds.

For details about the scheduling time of a single job, see Table 1.

**Table 1** Single-job multi-pod scheduling description
Number of Cluster Nodes	Number of Pods	Scheduling Time
100	100	14s
500	500	57s
1000	1000	114s
2000	2000	228s
3000	3000	269s
4000	4000	300s
5000	5000	400s
Notes: One YAML file can create multiple pods in the single-job multi-pod scenario. For example, if 100 pods are created by one YAML file, the time for scheduling the 100 pods to 100 nodes is 14 seconds. To optimize the scheduling time for 4,000 or 5,000 nodes, make adjustments by referring to 9. Currently, a maximum of 1,000 nodes can be scheduled for a vcjob.

Parent topic: Basic Scheduling Feature Guide