Parameters in the YAML File

This section provides a YAML configuration example for elastic training. Before performing specific operations, you need to understand the parameters of related YAML examples.

Table 1 Parameter description

Parameter

Value

Description

minAvailable

  • Single-server: 1
  • Distributed: N

N indicates the number of nodes. This parameter is not required for deploy jobs. You are advised to set this parameter to the same value as replicas.

replicas

  • Single-server: 1
  • Distributed: N

N indicates the number of job replicas.

maxRetry

0

Number of pod restart times. The pod restart function needs to be disabled for elastic training. Therefore, set this parameter to 0.

minReplicas

1

Minimum number of replicas. Set this parameter to the minimum number of nodes required by a job.

fault-scheduling

grace

Enable the graceful deletion mode for a job to gracefully delete the original pod during the process. If the failure persists after 15 minutes, forcibly delete the original pod.

force

This option is not supported currently.

NOTE:

Only the grace mode is supported.

off

None (no fault-scheduling field)

Other values

elastic-scheduling

on

Enable elastic training.

image

-

Training image name. Change it based on your actual requirements. (It is the name of the image created or obtained in the training image preparation section.)

(Optional) host-arch

ARM environment: huawei-arm

x86_64 environment: huawei-x86

Architecture of the node where a training job is executed. Set this parameter as required.

In a distributed training job, ensure that the nodes running the training job have the same architecture.

accelerator-type

The value varies according to the processor type, including:

Atlas 800 training server (fully populated with NPUs): module

-

huawei.com/Ascend910

The value varies according to the processor type, including:

Atlas 800 training server (fully populated with NPUs):
  • Single-server single-processor: 1
  • Single-server multi-processor: 2, 4, 8
  • Distributed: 1, 2, 4, 8

Number of requested NPUs. Set this parameter as required. The vNPU cannot be requested when the entire NPU is requested.

ring-controller.atlas

For an Atlas 800 training server (full configuration of NPUs), the value is ascend-910.

Type of the processor used by a job. You need to set this parameter in the ConfigMap and job task.

metadata.annotations['huawei.com/AscendXXX']

XXX indicates the processor model. The value can be 910, 310, or 310P. The value must be the same as the actual processor type in the environment.

Ascend Docker Runtime obtains the value of this parameter and mounts NPUs of the corresponding type to a container.

super-pod-affinity

Affinity scheduling policy used by SuperPoD jobs. You need to declare the policy in the label field of the YAML file.

  • soft: If the cluster resources do not meet the SuperPoD affinity requirements, the job uses the fragment resources in the cluster for scheduling.
  • hard: If the cluster resources do not meet the SuperPoD affinity requirements, the job enters the pending status and waits for resources.
  • Other values or no value: The SuperPoD affinity scheduling is forcibly used.
NOTE:

This parameter is supported only by the Atlas 900 A3 SuperPoD.

The number of replicas of a new job ranges from the number of minReplicas to that of replicas. The specific value is determined by the number of available nodes in the current cluster. This parameter is valid for multi-node distributed training.