Feature Description
Basic scheduling involves the following features:
- Training job: full NPU scheduling, elastic training, and static vNPU scheduling. For details about how to use resumable training, see Resumable Training.
- Inference job: full NPU scheduling, static vNPU scheduling, dynamic vNPU scheduling, recovery of inference card faults, and rescheduling upon inference card faults.
These features are implemented by different components. For details, see Basic Scheduling.
This document describes how to deploy and execute a model training or inference job using NPUs. The example in this section is for reference only. Since the production environment differs, configurations must be adjusted accordingly.
Job Types
Ascend Operator provides the following methods to configure resource information:
- Configuring resource information using environment variables: Environment variables are provided for distributed training jobs of different AI frameworks. For details, see Environment Variables of Ascend Operator. You can use this method to create only Ascend Job (acjob) objects.
- Configuring resource information using a file: Collective communication configuration file (RankTable file, also referred to as hccl.json) for a training job. You can use this method to create three types of objects: Volcano Job (vcjob), Ascend Job (acjob), and Deployment (deploy).
- (Recommended) Ascend Job (acjob): a job type customized by MindCluster. You can start a training or inference job by configuring resource information using environment variables or files.
Each acjob YAML file contains some fixed fields, such as apiVersion and kind. For more information about these fields, see Key Fields in acjob.
- Volcano Job (vcjob): applies to batch processing jobs that have the completed status.
- Deployment (deploy): applies to jobs that are always running in the background, which do not have the completed status. You can select this type when you need to continuously train jobs, continuously occupy resources, debug training jobs, or provide inference service APIs.
A Deployment job cannot be updated. To update a Deployment job, delete it and then create another one.
- (Recommended) Ascend Job (acjob): a job type customized by MindCluster. You can start a training or inference job by configuring resource information using environment variables or files.
Scheduling Time Description
The following content describes the reference time for scheduling an acjob on the Atlas 800T A2 training server in the Volcano multi-job or single-job scenario. To reach the following reference time, ensure that the CPU frequency is at least 2.60 GHz and the API Server latency is less than 80 ms. The scheduling time refers to the time when a job is delivered to the pod in the Running status.
- Multi-job scheduling time description
- A maximum of 100 single-server single-processor jobs can be created by 100 YAML files concurrently, with a scheduling time of 107 seconds.
- Five single-server single-processor jobs are created per second. After one minute, 300 single-server single-device jobs can be created, with a scheduling time of 293 seconds.
- For details about the scheduling time of a single job, see Table 1.
Table 1 Single-job multi-pod scheduling description Number of Cluster Nodes
Number of Pods
Scheduling Time
100
100
14s
500
500
57s
1000
1000
114s
2000
2000
228s
3000
3000
269s
4000
4000
300s
5000
5000
400s
Notes:
- One YAML file can create multiple pods in the single-job multi-pod scenario. For example, if 100 pods are created by one YAML file, the time for scheduling the 100 pods to 100 nodes is 14 seconds.
- To optimize the scheduling time for 4,000 or 5,000 nodes, make adjustments by referring to 9.
- Currently, a maximum of 1,000 nodes can be scheduled for a vcjob.