Before You Start

Prerequisites

  • Ensure that a corresponding storage scheme has been configured in the environment. For example, to use a network file system (NFS), perform operations described in Installing NFS.
  • Before using full NPU scheduling or static vNPU scheduling, ensure that related components have been installed. If they are not installed, install them by referring to Installation and Deployment.
    • Volcano or other schedulers
    • Ascend Device Plugin
    • Ascend Docker Runtime
    • Ascend Operator
    • ClusterD
    • NodeD
  • If the type of training jobs is acjob and Volcano is used for full NPU scheduling, batch pod creation and batch scheduling are supported.
    • To create pods in batches, use openFuyao-customized Kubernetes when installing Ascend Operator.
    • To use batch scheduling, use openFuyao-customized Kubernetes and volcano-ext when installing Volcano.
    • Batch scheduling applies to ultra-large clusters. In this scenario, you need to expand the CPU and memory resources allocated to MindCluster as required to prevent MindCluster from being evicted by Kubernetes due to degraded performance or over-usage of allocated memory.

Instructions

  • If you need to use full NPU scheduling or static vNPU scheduling through commands, you need to use Volcano or other schedulers. No matter which scheduler is selected, you need to use Ascend Operator to set resource information.
  • Use after integration: Integrate the cluster scheduling components into an existing third-party AI platform or an AI platform developed based on the cluster scheduling components.

Instruction

  • Resource monitoring can be used together with all features in the training scenario.
  • If multiple training jobs are running in a cluster at the same time, the features used by each job can be different.
  • Static vNPU scheduling must be used together with computing power virtualization. For details about static virtualization, see Static Virtualization.

Supported Products

  • Full NPU scheduling is supported by the following products:
    • Atlas training product
    • Atlas A2 training product
    • Atlas A3 training product
  • Static vNPU scheduling is supported by the following products:

    Atlas training product

Usage Process

Full NPU scheduling, static vNPU scheduling can be enabled using Volcano or other schedulers through commands, or after integration.

The process of using Volcano through CLI is the same as that of using other schedulers. To use other schedulers to prepare the YAML file of a job, refer to Use on the CLI (Other Schedulers). The other operations of using another scheduler are the same as those of using Volcano. For details, see Use on the CLI (Volcano).

Figure 1 Process of full NPU scheduling and static vNPU scheduling
  1. During script adaptation, you can configure resource information through environment variables or files as required.
  2. When preparing a YAML file to deliver a job, select a proper one for modification and adaptation based on the specific NPU model. You can select a proper YAML file as required by referring to Preparation of Job YAML Files.