Instructions

  • The MindX DL cluster scheduling components are mainly used in data centers to provide basic functions such as training and inference job scheduling and NPU device discovery. It does not include the upper-layer user interface and user service logic, and it can be used only after secondary development by integrators.
  • It is recommended that the scheduling components be used in the following scenarios:
    • A data center performs training and inference.
    • A device contains Huawei NPUs.
    • Deployment is based on containerization technologies.
    • The Kubernetes functions as the basic platform for job scheduling.
  • The following workload resource types of training jobs are supported:
    • Volcano job (recommended): applies to batch processing jobs, which have completed statuses.
    • Deployment: applies to jobs that are always running in the background. You can select this option when you need to use training jobs or resources, debug training jobs, or provide inference service APIs all the time.

      Restrictions on using a deployment: A deployment cannot be updated. To update a deployment, delete it and then create another one.

  • Security statement: Huawei ensures the security of cluster scheduling components. The code samples, model usage examples, and container images involved in this document are released in Gitee or Ascend Community and they are for reference only. If they are used for commercial purposes, users need to ensure their security (such as vulnerabilities) themselves.

Model Training Job Description

Based on the server type, the restrictions on training jobs are as follows:

  • Atlas 800 training server
    • The number of NPUs allocated for a training job is 1, 2, 4, 8, or a multiple of 8. If two or four NPUs are allocated, processors allocated based on affinity constraints must be in the same area of the same server (processors 0 to 3 form one area and processors 4 to 7 form another one). For example, if two NPUs are allocated for training, both of them must be deployed either in area one (processors 0 to 3) or area two (processors 4 to 7) of the same server. Specifically, they cannot be in area one and area two at the same time. This requirement is met when the Volcano is used to schedule jobs.
    • If the total number of Ascend 910 AI Processors allocated for a training job is less than or equal to 8, only one pod can be allocated. If the number is greater than 8, each pod has eight Ascend 910 AI Processors.
  • Atlas 800 training server (half configuration of NPUs)
    • The number of NPUs allocated for a training job is 1, 2, 4, or a multiple of 4. If the number of allocated NPUs is 1, 2, or 4, only single-device training is supported, and the processors are in the same area. This requirement is met when the Volcano is used to schedule jobs.
    • If the total number of Ascend 910 AI Processors allocated for a training job is less than or equal to 4, only one pod can be allocated. If the number is greater than 4, each pod has four Ascend 910 AI Processors.
  • Servers (with Atlas 300T training cards)
    • The number of NPUs allocated a training job is 1, 2, or a multiple of 2. If the number of allocated NPUs is 1 or 2, only single-device training is supported. This requirement is met when the Volcano is used to schedule jobs.
    • If the total number of Ascend 910 AI Processors allocated for a training job is less than or equal to 2, only one pod can be allocated. If the number is greater than 2, each pod has two Ascend 910 AI Processors.