Parameters in the YAML File
This section provides a YAML configuration example for elastic training. Before performing specific operations, you need to understand the parameters of related YAML examples.
Parameter |
Value |
Description |
|---|---|---|
minAvailable |
|
N indicates the number of nodes. This parameter is not required for deploy jobs. You are advised to set this parameter to the same value as replicas. |
replicas |
|
N indicates the number of job replicas. |
maxRetry |
0 |
Number of pod restart times. The pod restart function needs to be disabled for elastic training. Therefore, set this parameter to 0. |
minReplicas |
1 |
Minimum number of replicas. Set this parameter to the minimum number of nodes required by a job. |
fault-scheduling |
grace |
Enable the graceful deletion mode for a job to gracefully delete the original pod during the process. If the failure persists after 15 minutes, forcibly delete the original pod. |
force |
This option is not supported currently. NOTE:
Only the grace mode is supported. |
|
off |
||
None (no fault-scheduling field) |
||
Other values |
||
elastic-scheduling |
on |
Enable elastic training. |
image |
- |
Training image name. Change it based on your actual requirements. (It is the name of the image created or obtained in the training image preparation section.) |
(Optional) host-arch |
ARM environment: huawei-arm x86_64 environment: huawei-x86 |
Architecture of the node where a training job is executed. Set this parameter as required. In a distributed training job, ensure that the nodes running the training job have the same architecture. |
accelerator-type |
The value varies according to the processor type, including: Atlas 800 training server (fully populated with NPUs): module |
- |
huawei.com/Ascend910 |
The value varies according to the processor type, including: Atlas 800 training server (fully populated with NPUs):
|
Number of requested NPUs. Set this parameter as required. The vNPU cannot be requested when the entire NPU is requested. |
ring-controller.atlas |
For an Atlas 800 training server (full configuration of NPUs), the value is ascend-910. |
Type of the processor used by a job. You need to set this parameter in the ConfigMap and job task. |
metadata.annotations['huawei.com/AscendXXX'] |
XXX indicates the processor model. The value can be 910, 310, or 310P. The value must be the same as the actual processor type in the environment. |
Ascend Docker Runtime obtains the value of this parameter and mounts NPUs of the corresponding type to a container. |
super-pod-affinity |
Affinity scheduling policy used by SuperPoD jobs. You need to declare the policy in the label field of the YAML file.
|
NOTE:
This parameter is supported only by the Atlas 900 A3 SuperPoD. |
The number of replicas of a new job ranges from the number of minReplicas to that of replicas. The specific value is determined by the number of available nodes in the current cluster. This parameter is valid for multi-node distributed training.