Parameters in the YAML File

This section describes how to configure YAML files for full NPU scheduling or static vNPU scheduling. Before that, you need to understand parameters in the YAML example.

  • See Table 1 for Ascend Job (acjob).
  • See Table 2 for Volcano Job (vcjob).

YAML Parameters (acjob)

The following table describes the YAML parameters that can be used in a training acjob.

Each acjob YAML file contains some fixed fields, such as apiVersion and kind. For more information about these fields, see Key Fields in acjob.

Table 1 YAML parameters

Parameter

Value

Description

framework

  • mindspore
  • pytorch
  • tensorflow

Framework type. Currently, only three types are supported.

ring-controller.atlas

  • For Atlas A2 training product, A200T A3 Box8 SuperPoD Server, Atlas 900 A3 SuperPoD, and Atlas 800T A3 SuperPoD Server, the value is ascend-{xxx}b.
  • For an Atlas 800 training server and a server with Atlas 300T training cards, the value is ascend-910.

Type of the processor used by a job.

podgroup-sched-enable

"true"

This parameter is configured only when the openFuyao-customized Kubernetes and volcano-ext are used in the cluster.

  • If the value is set to true, batch scheduling is enabled.
  • If the value is set to other character strings, batch scheduling does not take effect and common scheduling is used.

If this parameter is not set, batch scheduling does not take effect and common scheduling is used.

NOTE:
  • This parameter is supported only by Volcano with full NPU scheduling enabled.
  • This parameter can be configured only for the Atlas 900 A3 SuperPoD and Atlas 800T A3 SuperPoD Server.

schedulerName

The default value is volcano. Set this parameter based on your actual requirements.

Scheduler selected when Ascend Operator enables gang scheduling.

minAvailable

The default value is the total number of job replicas.

Total number of job replicas when Ascend Operator enables gang scheduling and Volcano is used as the scheduler.

queue

The default value is default. Set this parameter based on your actual requirements.

Queue to which a job belongs when Ascend Operator enables gang scheduling and Volcano is used as the scheduler.

(Optional) successPolicy

  • The default value is null. If you do not set this parameter, the default value null is used.
  • AllWorkers

Prerequisite for a successful job. The null value indicates that if only one pod succeeds, the entire job is considered successful. The AllWorkers value indicates that all pods need to succeed for the job to be considered as successful.

container.name

ascend

The container name must be ascend.

(Optional) ports

If you do not set corresponding parameters, the system fills in the following values by default:

  • name: ascendjob-port
  • containerPort: 2222

Collective communication port for distributed training. The value of name can only be ascendjob-port. You can set containerPort as required. If containerPort is not set, the default port 2222 is used.

replicas

  • Single-server: 1
  • Distributed: N

N indicates the number of job replicas.

image

-

Training image name. Change it based on your actual requirements. (It is the name of the image created in the image preparation section.)

(Optional) host-arch

ARM environment: huawei-arm

x86_64 environment: huawei-x86

Architecture of the node where a training job is executed. Set this parameter as required.

In a distributed training job, ensure that the nodes running the training job have the same architecture.

huawei.com/recover_policy_path

pod: Only pod-level rescheduling is supported. Rescheduling at the job level is not supported.

Job rescheduling policy.

huawei.com/schedule_minAvailable

Integer

Minimum number of replicas that can be scheduled by a job.

huawei.com/schedule_policy

See Table 3 for its configurations.

Job's AI processor layout to be scheduled. Volcano selects a proper scheduling policy based on this field. If this parameter is not set, the scheduling policy is selected based on accelerator-type.

NOTE:

This field can be used only on the Atlas training product, Atlas A2 training product, and Atlas A3 training product.

sp-block

Number of processors on logical SuperPoDs.

  • For a single-server system, the value must be the same as the number of processors requested by a job.
  • For a distributed system, the value must be an integer multiple of the number of processors on a node, and the total number of processors requested by a job must be an integer multiple of the value.

Cluster scheduling components divide logical SuperPoDs on physical SuperPoDs based on the division policy for affinity scheduling of training jobs. If this field is not specified, Volcano sets the size of the logical SuperPoD of a job to the total number of NPUs configured for the job during scheduling.

For details, see UnifiedBus Interconnect Device Network Description.

NOTE:

tor-affinity

  • large-model-schema: foundation model jobs or padding jobs
  • normal-schema: common job
  • null: switch affinity scheduling not used
    NOTE:

    You need to select a job type based on the number of job replicas. If the number of job replicas is less than 4, the job is a padding job. If the number of job replicas is greater than or equal to 4, the job is a foundation model job. The number of replicas of a common job is not limited.

The default value is null, indicating that switch affinity scheduling is not used. You need to set this parameter based on the job type.

NOTE:
  • Switch affinity scheduling 1.0 supports Atlas training product and Atlas A2 training product under the PyTorch and MindSpore frameworks.
  • Switch affinity scheduling 2.0 supports Atlas A2 training product under the PyTorch framework.
  • Switch affinity scheduling is supported only on the entire NPU. Static vNPU scheduling is not supported.

accelerator-type

The value varies according to the processor type, including:

  • Atlas 800 training server (fully populated with NPUs): module
  • Atlas 800 training server (half populated with NPUs): half
  • Server (with Atlas 300T training cards): card
  • Atlas 800T A2 training server and Atlas 900 A2 PoD cluster basic unit: module-{xxx}b-8
  • Atlas 200T A2 Box16 heterogeneous subrack: module-{xxx}b-16
  • A200T A3 Box8 SuperPoD Server: module-a3-16
  • (Optional) Atlas 800 training server (fully populated with NPUs): This label can be omitted.
  • Atlas 900 A3 SuperPoD: module-a3-16-super-pod

Set this parameter based on the type of the node where a training job is executed.

NOTE:

You can run the npu-smi info command to query the number in the chip model name, which is indicated by the Name field in the returned message. As an example below, the value of {xxx} is 910.

requests

Full NPU scheduling

huawei.com/Ascend910: x

The value of x varies according to the processor type, including:

  • Atlas 800 training server (fully populated with NPUs):
    • Single-server single-processor job: 1
    • Single-server multi-processor job: 2, 4, 8
    • Distributed job: 1, 2, 4, 8
  • Atlas 800 training server (half populated with NPUs):
    • Single-server single-processor job: 1
    • Single-server multi-processor job: 2, 4
    • Distributed job: 1, 2, 4
  • Server (with Atlas 300T training cards):
    • Single-server single-processor job: 1
    • Single-server multi-processor job: 2
    • Distributed job: 2
  • Atlas 800T A2 training server and Atlas 900 A2 PoD cluster basic unit:
    • Single-server single-processor job: 1
    • Single-server multi-processor job: 2, 3, 4, 5, 6, 7, 8
    • Distributed job: 1, 2, 3, 4, 5, 6, 7, 8
  • Atlas 200T A2 Box16 heterogeneous subrack:
    • Single-server single-processor job: 1
    • Single-server multi-processor job: 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16
    • Distributed job: 1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16
  • Atlas 900 A3 SuperPoD, A200T A3 Box8 SuperPoD Server, and Atlas 800T A3 SuperPoD Server:
    • Single-server multi-processor job: 2, 4, 6, 8, 10, 12, 14, 16
    • Distributed job: 2, 4, 6, 8, 10, 12, 14, 16
    • Logical SuperPoD affinity for Atlas 900 A3 SuperPoD: 16

Static vNPU scheduling

huawei.com/Ascend910-Y: 1

The value is 1. Only the vNPUs of one NPU can be used.

Example: huawei.com/Ascend910-6c.1cpu.16g: 1

Type and number of requested NPUs or vNPUs. Only one type can be requested. Change the value as required.

NOTE:

For details about the value of Y, see the vNPU type column in the mapping between virtual instance templates and virtual device types in Static Virtualization.

For example, if the vNPU type is Ascend910-6c.1cpu.16g, the value of Y is 6c.1cpu.16g, excluding Ascend910.

For more information about the virtualization template, see "Virtualization Templates" in Virtualization Rules.

limits

Type and number of requested NPUs or vNPUs. Only one type can be requested. Change the value as required.

The processor name and quantity in limits must be the same as those in requests.

metadata.annotations['huawei.com/AscendXXX']

XXX indicates the processor model. The value can be 910, 310, or 310P. The value must be the same as the actual processor type in the environment.

Ascend Docker Runtime obtains the value of this parameter and mounts NPUs of the corresponding type to a container.

NOTE:

This parameter applies only to the full NPU scheduling feature of the Volcano scheduler. If you use static vNPU schedulingand other schedulers, delete fields of this parameter from the example YAML file.

hostNetwork

  • true: The host IP address is used to create a pod.

    In this case, you need to set the environment variable HCCL_IF_IP to status.hostIP in the YAML file.

  • false: The host IP address is not used to create a pod.

    If this parameter is not specified or is set to false, the preceding environment variable does not need to be configured.

  • If the cluster scale is large (the number of nodes is greater than 1000), you are advised to use the host IP address to create a pod.
  • If this parameter is not specified, the host IP address is not used to create a pod by default.
    NOTE:

    If you use the host IP address to create a pod, both the speed of pod creation and communication between pods will be slow. To improve this, you are advised to mount the RankTable file. By parsing the RankTable file, you can obtain the host IP address of the pod and inject it into the environment variables of the corresponding framework job (e.g., inject MS_SCHED_HOST to the MindSpore framework) to establish a connection.

super-pod-affinity

Affinity scheduling policy used by SuperPoD jobs. You need to declare the policy in the label field of the YAML file.

  • soft: If the cluster resources do not meet the SuperPoD affinity requirements, the job uses the fragment resources in the cluster for scheduling.
  • hard: If the cluster resources do not meet the SuperPoD affinity requirements, the job enters the pending status and waits for resources.
  • Other values or no value: The SuperPoD affinity scheduling is forcibly used.
NOTE:

This parameter is supported only by the Atlas 900 A3 SuperPoD.

YAML Parameters (deploy or vcjob)

The following table describes the YAML parameters that can be used in a deploy job or training vcjob.

Table 2 YAML parameters

Parameter

Value

Description

minAvailable

  • Single-server: 1
  • Distributed: N

N indicates the number of nodes. This parameter is not required for deploy jobs. You are advised to set this parameter to the same value as replicas.

replicas

  • Single-server: 1
  • Distributed: N

N indicates the number of job replicas.

image

-

Training image name. Change it based on your actual requirements. (It is the name of the image created in the image preparation section.)

(Optional) host-arch

ARM environment: huawei-arm

x86_64 environment: huawei-x86

Architecture of the node where a training job is executed. Set this parameter as required.

In a distributed training job, ensure that the nodes running the training job have the same architecture.

huawei.com/recover_policy_path

pod: Only pod-level rescheduling is supported. Rescheduling at the job level is not supported.

Job rescheduling policy.

huawei.com/schedule_minAvailable

Integer

Minimum number of replicas that can be scheduled by a job.

huawei.com/schedule_policy

See Table 3 for its configurations.

Job's AI processor layout to be scheduled. Volcano selects a proper scheduling policy based on this field. If this parameter is not set, the scheduling policy is selected based on accelerator-type.

NOTE:

This field can be used only on the Atlas training product, Atlas A2 training product, and Atlas A3 training product.

sp-block

Number of processors on logical SuperPoDs.

  • For a single-server system, the value must be the same as the number of processors requested by a job.
  • For a distributed system, the value must be an integer multiple of the number of processors on a node, and the total number of processors requested by a job must be an integer multiple of the value.

Cluster scheduling components divide logical SuperPoDs on physical SuperPoDs based on the division policy for affinity scheduling of training jobs. If this field is not specified, Volcano sets the size of the logical SuperPoD of a job to the total number of NPUs configured for the job during scheduling.

For details, see UnifiedBus Interconnect Device Network Description.

NOTE:

tor-affinity

  • large-model-schema: foundation model jobs or padding jobs
  • normal-schema: common job
  • null: switch affinity scheduling not used
    NOTE:

    You need to select a job type based on the number of job replicas. If the number of job replicas is less than 4, the job is a padding job. If the number of job replicas is greater than or equal to 4, the job is a foundation model job. The number of replicas of a common job is not limited.

The default value is null, indicating that switch affinity scheduling is not used. You need to set this parameter based on the job type.

NOTE:
  • Switch affinity scheduling 1.0 supports Atlas training product and Atlas A2 training product under the PyTorch and MindSpore frameworks.
  • Switch affinity scheduling 2.0 supports Atlas A2 training product under the PyTorch framework.
  • Switch affinity scheduling is supported only on the entire NPU. Static vNPU scheduling is not supported.

accelerator-type

The value varies according to the processor type, including:

  • Atlas 800 training server (fully populated with NPUs): module
  • Atlas 800 training server (half populated with NPUs): half
  • Server (with Atlas 300T training cards): card
  • Atlas 800T A2 training server and Atlas 900 A2 PoD cluster basic unit: module-{xxx}b-8
  • Atlas 200T A2 Box16 heterogeneous subrack: module-{xxx}b-16
  • A200T A3 Box8 SuperPoD Server: module-a3-16
  • (Optional) Atlas 800 training server (fully populated with NPUs): This label can be omitted.
  • Atlas 900 A3 SuperPoD: module-a3-16-super-pod

Set this parameter based on the type of the node where a training job is executed.

NOTE:

You can run the npu-smi info command to query the number in the chip model name, which is indicated by the Name field in the returned message. As an example below, the value of {xxx} is 910.

requests

Full NPU scheduling

huawei.com/Ascend910: x

The value of x varies according to the processor type, including:

  • Atlas 800 training server (fully populated with NPUs):
    • Single-server single-processor job: 1
    • Single-server multi-processor job: 2, 4, 8
    • Distributed job: 1, 2, 4, 8
  • Atlas 800 training server (half populated with NPUs):
    • Single-server single-processor job: 1
    • Single-server multi-processor job: 2, 4
    • Distributed job: 1, 2, 4
  • Server (with Atlas 300T training cards):
    • Single-server single-processor job: 1
    • Single-server multi-processor job: 2
    • Distributed job: 2
  • Atlas 800T A2 training server and Atlas 900 A2 PoD cluster basic unit:
    • Single-server single-processor job: 1
    • Single-server multi-processor job: 2, 3, 4, 5, 6, 7, 8
    • Distributed job: 1, 2, 3, 4, 5, 6, 7, 8
  • Atlas 200T A2 Box16 heterogeneous subrack:
    • Single-server single-processor job: 1
    • Single-server multi-processor job: 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16
    • Distributed job: 1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16
  • Atlas 900 A3 SuperPoD, A200T A3 Box8 SuperPoD Server, and Atlas 800T A3 SuperPoD Server:
    • Single-server multi-processor job: 2, 4, 6, 8, 10, 12, 14, 16
    • Distributed job: 16

Static vNPU scheduling

huawei.com/Ascend910-Y: 1

The value is 1. Only the vNPUs of one NPU can be used.

Example: huawei.com/Ascend910-6c.1cpu.16g: 1

Type and number of requested NPUs or vNPUs. Only one type can be requested. Change the value as required.

NOTE:

For details about the value of Y, see the vNPU type column in the mapping between virtual instance templates and virtual device types in Static Virtualization.

For example, if the vNPU type is Ascend910-6c.1cpu.16g, the value of Y is 6c.1cpu.16g, excluding Ascend910.

For more information about the virtualization template, see "Virtualization Templates" in Virtualization Rules.

limits

Type and number of requested NPUs or vNPUs. Only one type can be requested. Change the value as required.

The processor name and quantity in limits must be the same as those in requests.

metadata.annotations['huawei.com/AscendXXX']

XXX indicates the processor model. The value can be 910, 310, or 310P. The value must be the same as the actual processor type in the environment.

Ascend Docker Runtime obtains the value of this parameter and mounts NPUs of the corresponding type to a container.

NOTE:

This parameter is supported only by Volcano with full NPU scheduling enabled. If you use static vNPU scheduling and other schedulers, delete fields of this parameter from the example YAML file.

ring-controller.atlas

The value varies according to the processor type, including:

  • For an Atlas 800 training server and a server with Atlas 300T training cards, the value is ascend-910.
  • For Atlas A2 training product, A200T A3 Box8 SuperPoD Server, Atlas 900 A3 SuperPoD, and Atlas 800T A3 SuperPoD Server, the value is ascend-{xxx}b.

Type of the processor used by a job. You need to set this parameter both in ConfigMap and task.

NOTE:

You can run the npu-smi info command to query the number in the processor model name, which is indicated by the Name field in the returned message. The value of {xxx} is 910.

super-pod-affinity

Affinity scheduling policy used by SuperPoD jobs. You need to declare the policy in the label field of the YAML file.

  • soft: If the cluster resources do not meet the SuperPoD affinity requirements, the job uses the fragment resources in the cluster for scheduling.
  • hard: If the cluster resources do not meet the SuperPoD affinity requirements, the job enters the pending status and waits for resources.
  • Other values or no value: The SuperPoD affinity scheduling is forcibly used.
NOTE:

This parameter is supported only by the Atlas 900 A3 SuperPoD.

Table 3 huawei.com/schedule_policy configuration description

Configuration

Description

chip4-node8

One node has eight processors, and four processors form an interconnection ring, for example, the processor layout of the Atlas 800 training server (model 9000) or Atlas 800 training server (model 9010).

chip1-node2

One node has two processors. For example, one Atlas 300T training card can be equipped with only one processor, and one node can be equipped with a maximum of two Atlas 300T training cards.

chip4-node4

One node has four processors, and four processors form an interconnection ring, for example, the processor layout of the Atlas 800 training server (model 9000) or Atlas 800 training server (model 9010).

chip8-node8

One node has eight processors, and eight processors form on one interconnection ring, for example, the processor layout of the Atlas 800T A2 training server.

chip8-node16

One node has 16 processors, and eight processors form on one interconnection ring, for example, the processor layout of the Atlas 200T A2 Box16 heterogeneous subrack.

chip2-node16

One node has 16 processors, and two processors form on one interconnection ring, for example, the processor layout of the Atlas 800T A3 SuperPoD Server.

chip2-node16-sp

One node has 16 processors, and two processors form on one interconnection ring, and multiple servers form a SuperPoD, for example, the processor layout of the Atlas 900 A3 SuperPoD.