Parameters in the YAML File
This section describes how to configure YAML files for full NPU scheduling or static vNPU scheduling. Before that, you need to understand parameters in the YAML example.
YAML Parameters (acjob)
The following table describes the YAML parameters that can be used in a training acjob.
Each acjob YAML file contains some fixed fields, such as apiVersion and kind. For more information about these fields, see Key Fields in acjob.
Parameter |
Value |
Description |
|---|---|---|
framework |
|
Framework type. Currently, only three types are supported. |
ring-controller.atlas |
|
Type of the processor used by a job. |
podgroup-sched-enable |
"true" |
This parameter is configured only when the openFuyao-customized Kubernetes and volcano-ext are used in the cluster.
If this parameter is not set, batch scheduling does not take effect and common scheduling is used. NOTE:
|
schedulerName |
The default value is volcano. Set this parameter based on your actual requirements. |
Scheduler selected when Ascend Operator enables gang scheduling. |
minAvailable |
The default value is the total number of job replicas. |
Total number of job replicas when Ascend Operator enables gang scheduling and Volcano is used as the scheduler. |
queue |
The default value is default. Set this parameter based on your actual requirements. |
Queue to which a job belongs when Ascend Operator enables gang scheduling and Volcano is used as the scheduler. |
(Optional) successPolicy |
|
Prerequisite for a successful job. The null value indicates that if only one pod succeeds, the entire job is considered successful. The AllWorkers value indicates that all pods need to succeed for the job to be considered as successful. |
container.name |
ascend |
The container name must be ascend. |
(Optional) ports |
If you do not set corresponding parameters, the system fills in the following values by default:
|
Collective communication port for distributed training. The value of name can only be ascendjob-port. You can set containerPort as required. If containerPort is not set, the default port 2222 is used. |
replicas |
|
N indicates the number of job replicas. |
image |
- |
Training image name. Change it based on your actual requirements. (It is the name of the image created in the image preparation section.) |
(Optional) host-arch |
ARM environment: huawei-arm x86_64 environment: huawei-x86 |
Architecture of the node where a training job is executed. Set this parameter as required. In a distributed training job, ensure that the nodes running the training job have the same architecture. |
huawei.com/recover_policy_path |
pod: Only pod-level rescheduling is supported. Rescheduling at the job level is not supported. |
Job rescheduling policy. |
huawei.com/schedule_minAvailable |
Integer |
Minimum number of replicas that can be scheduled by a job. |
huawei.com/schedule_policy |
See Table 3 for its configurations. |
Job's AI processor layout to be scheduled. Volcano selects a proper scheduling policy based on this field. If this parameter is not set, the scheduling policy is selected based on accelerator-type. NOTE:
This field can be used only on the Atlas training product, |
sp-block |
Number of processors on logical SuperPoDs.
|
Cluster scheduling components divide logical SuperPoDs on physical SuperPoDs based on the division policy for affinity scheduling of training jobs. If this field is not specified, Volcano sets the size of the logical SuperPoD of a job to the total number of NPUs configured for the job during scheduling. For details, see UnifiedBus Interconnect Device Network Description. NOTE:
|
tor-affinity |
|
The default value is null, indicating that switch affinity scheduling is not used. You need to set this parameter based on the job type. NOTE:
|
accelerator-type |
The value varies according to the processor type, including:
|
Set this parameter based on the type of the node where a training job is executed. NOTE:
You can run the npu-smi info command to query the number in the chip model name, which is indicated by the Name field in the returned message. As an example below, the value of {xxx} is 910. |
requests |
Full NPU scheduling huawei.com/Ascend910: x The value of x varies according to the processor type, including:
Static vNPU scheduling huawei.com/Ascend910-Y: 1 The value is 1. Only the vNPUs of one NPU can be used. Example: huawei.com/Ascend910-6c.1cpu.16g: 1 |
Type and number of requested NPUs or vNPUs. Only one type can be requested. Change the value as required. NOTE:
For details about the value of Y, see the vNPU type column in the mapping between virtual instance templates and virtual device types in Static Virtualization. For example, if the vNPU type is Ascend910-6c.1cpu.16g, the value of Y is 6c.1cpu.16g, excluding Ascend910. For more information about the virtualization template, see "Virtualization Templates" in Virtualization Rules. |
limits |
Type and number of requested NPUs or vNPUs. Only one type can be requested. Change the value as required. The processor name and quantity in limits must be the same as those in requests. |
|
metadata.annotations['huawei.com/AscendXXX'] |
XXX indicates the processor model. The value can be 910, 310, or 310P. The value must be the same as the actual processor type in the environment. |
Ascend Docker Runtime obtains the value of this parameter and mounts NPUs of the corresponding type to a container. NOTE:
This parameter applies only to the full NPU scheduling feature of the Volcano scheduler. If you use static vNPU schedulingand other schedulers, delete fields of this parameter from the example YAML file. |
hostNetwork |
|
|
super-pod-affinity |
Affinity scheduling policy used by SuperPoD jobs. You need to declare the policy in the label field of the YAML file.
|
NOTE:
This parameter is supported only by the Atlas 900 A3 SuperPoD. |
YAML Parameters (deploy or vcjob)
The following table describes the YAML parameters that can be used in a deploy job or training vcjob.
Parameter |
Value |
Description |
|---|---|---|
minAvailable |
|
N indicates the number of nodes. This parameter is not required for deploy jobs. You are advised to set this parameter to the same value as replicas. |
replicas |
|
N indicates the number of job replicas. |
image |
- |
Training image name. Change it based on your actual requirements. (It is the name of the image created in the image preparation section.) |
(Optional) host-arch |
ARM environment: huawei-arm x86_64 environment: huawei-x86 |
Architecture of the node where a training job is executed. Set this parameter as required. In a distributed training job, ensure that the nodes running the training job have the same architecture. |
huawei.com/recover_policy_path |
pod: Only pod-level rescheduling is supported. Rescheduling at the job level is not supported. |
Job rescheduling policy. |
huawei.com/schedule_minAvailable |
Integer |
Minimum number of replicas that can be scheduled by a job. |
huawei.com/schedule_policy |
See Table 3 for its configurations. |
Job's AI processor layout to be scheduled. Volcano selects a proper scheduling policy based on this field. If this parameter is not set, the scheduling policy is selected based on accelerator-type. NOTE:
This field can be used only on the Atlas training product, |
sp-block |
Number of processors on logical SuperPoDs.
|
Cluster scheduling components divide logical SuperPoDs on physical SuperPoDs based on the division policy for affinity scheduling of training jobs. If this field is not specified, Volcano sets the size of the logical SuperPoD of a job to the total number of NPUs configured for the job during scheduling. For details, see UnifiedBus Interconnect Device Network Description. NOTE:
|
tor-affinity |
|
The default value is null, indicating that switch affinity scheduling is not used. You need to set this parameter based on the job type. NOTE:
|
accelerator-type |
The value varies according to the processor type, including:
|
Set this parameter based on the type of the node where a training job is executed. NOTE:
You can run the npu-smi info command to query the number in the chip model name, which is indicated by the Name field in the returned message. As an example below, the value of {xxx} is 910. |
requests |
Full NPU scheduling huawei.com/Ascend910: x The value of x varies according to the processor type, including:
Static vNPU scheduling huawei.com/Ascend910-Y: 1 The value is 1. Only the vNPUs of one NPU can be used. Example: huawei.com/Ascend910-6c.1cpu.16g: 1 |
Type and number of requested NPUs or vNPUs. Only one type can be requested. Change the value as required. NOTE:
For details about the value of Y, see the vNPU type column in the mapping between virtual instance templates and virtual device types in Static Virtualization. For example, if the vNPU type is Ascend910-6c.1cpu.16g, the value of Y is 6c.1cpu.16g, excluding Ascend910. For more information about the virtualization template, see "Virtualization Templates" in Virtualization Rules. |
limits |
Type and number of requested NPUs or vNPUs. Only one type can be requested. Change the value as required. The processor name and quantity in limits must be the same as those in requests. |
|
metadata.annotations['huawei.com/AscendXXX'] |
XXX indicates the processor model. The value can be 910, 310, or 310P. The value must be the same as the actual processor type in the environment. |
Ascend Docker Runtime obtains the value of this parameter and mounts NPUs of the corresponding type to a container. NOTE:
This parameter is supported only by Volcano with full NPU scheduling enabled. If you use static vNPU scheduling and other schedulers, delete fields of this parameter from the example YAML file. |
ring-controller.atlas |
The value varies according to the processor type, including:
|
Type of the processor used by a job. You need to set this parameter both in ConfigMap and task. NOTE:
You can run the npu-smi info command to query the number in the processor model name, which is indicated by the Name field in the returned message. The value of {xxx} is 910. |
super-pod-affinity |
Affinity scheduling policy used by SuperPoD jobs. You need to declare the policy in the label field of the YAML file.
|
NOTE:
This parameter is supported only by the Atlas 900 A3 SuperPoD. |
Configuration |
Description |
|---|---|
chip4-node8 |
One node has eight processors, and four processors form an interconnection ring, for example, the processor layout of the Atlas 800 training server (model 9000) or Atlas 800 training server (model 9010). |
chip1-node2 |
One node has two processors. For example, one Atlas 300T training card can be equipped with only one processor, and one node can be equipped with a maximum of two Atlas 300T training cards. |
chip4-node4 |
One node has four processors, and four processors form an interconnection ring, for example, the processor layout of the Atlas 800 training server (model 9000) or Atlas 800 training server (model 9010). |
chip8-node8 |
One node has eight processors, and eight processors form on one interconnection ring, for example, the processor layout of the Atlas 800T A2 training server. |
chip8-node16 |
One node has 16 processors, and eight processors form on one interconnection ring, for example, the processor layout of the Atlas 200T A2 Box16 heterogeneous subrack. |
chip2-node16 |
One node has 16 processors, and two processors form on one interconnection ring, for example, the processor layout of the Atlas 800T A3 SuperPoD Server. |
chip2-node16-sp |
One node has 16 processors, and two processors form on one interconnection ring, and multiple servers form a SuperPoD, for example, the processor layout of the Atlas 900 A3 SuperPoD. |