Feature Description

Basic scheduling involves the following features:

This document describes how to deploy and execute a model training or inference job using NPUs. The example in this section is for reference only. Since the production environment differs, configurations must be adjusted accordingly.

Job Types

Ascend Operator provides the following methods to configure resource information:

  • Configuring resource information using environment variables: Environment variables are provided for distributed training jobs of different AI frameworks. For details, see Environment Variables of Ascend Operator. You can use this method to create only Ascend Job (acjob) objects.
  • Configuring resource information using a file: Collective communication configuration file (RankTable file, also referred to as hccl.json) for a training job. You can use this method to create three types of objects: Volcano Job (vcjob), Ascend Job (acjob), and Deployment (deploy).
    • (Recommended) Ascend Job (acjob): a job type customized by MindCluster. You can start a training or inference job by configuring resource information using environment variables or files.

      Each acjob YAML file contains some fixed fields, such as apiVersion and kind. For more information about these fields, see Key Fields in acjob.

    • Volcano Job (vcjob): applies to batch processing jobs that have the completed status.
    • Deployment (deploy): applies to jobs that are always running in the background, which do not have the completed status. You can select this type when you need to continuously train jobs, continuously occupy resources, debug training jobs, or provide inference service APIs.

      A Deployment job cannot be updated. To update a Deployment job, delete it and then create another one.

Scheduling Time Description

The following content describes the reference time for scheduling an acjob on the Atlas 800T A2 training server in the Volcano multi-job or single-job scenario. To reach the following reference time, ensure that the CPU frequency is at least 2.60 GHz and the API Server latency is less than 80 ms. The scheduling time refers to the time when a job is delivered to the pod in the Running status.

  • Multi-job scheduling time description
    • A maximum of 100 single-server single-processor jobs can be created by 100 YAML files concurrently, with a scheduling time of 107 seconds.
    • Five single-server single-processor jobs are created per second. After one minute, 300 single-server single-device jobs can be created, with a scheduling time of 293 seconds.
  • For details about the scheduling time of a single job, see Table 1.
    Table 1 Single-job multi-pod scheduling description

    Number of Cluster Nodes

    Number of Pods

    Scheduling Time

    100

    100

    14s

    500

    500

    57s

    1000

    1000

    114s

    2000

    2000

    228s

    3000

    3000

    269s

    4000

    4000

    300s

    5000

    5000

    400s

    Notes:

    • One YAML file can create multiple pods in the single-job multi-pod scenario. For example, if 100 pods are created by one YAML file, the time for scheduling the 100 pods to 100 nodes is 14 seconds.
    • To optimize the scheduling time for 4,000 or 5,000 nodes, make adjustments by referring to 9.
    • Currently, a maximum of 1,000 nodes can be scheduled for a vcjob.