YAML Selection

Various YAML examples are provided by cluster scheduling components. You can select an appropriate YAML example based on the used component, processor type, and job type, and make necessary modifications according to actual requirements before using it.

Resource Information Configuration Using Environment Variables

  • If Atlas A2 training product are used in the current environment, refer to Table 1 to obtain the corresponding YAML example.

    Then, modify and adapt the YAML files of the Atlas 800T A2 training server, Atlas 200T A2 Box16 heterogeneous subrack, and A200T A3 Box8 SuperPoD Server based on the parameter description provided in Table 1.

  • If Atlas training product are used in the current environment, refer to Table 2 to obtain the corresponding YAML example.

    Then, modify the YAML files of the servers (with Atlas 300T training cards) based on the YAML file of the Atlas 800 training server and the parameter description provided in Table 1.

  • If Atlas A3 training product are used in the current environment, refer to Table 3 to obtain the corresponding YAML example.
Table 1 YAML files supported by Atlas A2 training product

Job Type

Hardware Model

Training Framework

YAML File Name

Description

How to Obtain

AscendJob

Atlas 900 A2 PoD cluster basic unit

TensorFlow

tensorflow_multinodes_acjob_{xxx}b.yaml

A two-server two-processor job is presented in the example file by default.

Select the corresponding training framework and obtain the YAML file.

NOTE:

You can run the npu-smi info command to query the number in the processor model, which is indicated by the Name field in the returned message. In the following example, the value of {xxx} is 910.

PyTorch

pytorch_multinodes_acjob_{xxx}b.yaml

A two-server two-processor job is presented in the example file by default.

MindSpore

mindspore_multinodes_acjob_{xxx}b.yaml

A two-server 16-processor job is presented in the example file by default.

TensorFlow

tensorflow_standalone_acjob_{xxx}b.yaml

A single-server single-processor job is presented in the example file by default.

MindSpore

mindspore_standalone_acjob_{xxx}b.yaml

PyTorch

pytorch_standalone_acjob_{xxx}b.yaml

pytorch_multinodes_acjob_{xxx}b_with_ranktable.yaml

A single-server two-processor job is presented in the example file by default. Use Ascend Operator to generate a RankTable file.

Table 2 YAML files supported by Atlas training product

Job Type

Hardware Model

Training Framework

YAML File Name

Description

How to Obtain

AscendJob

Atlas 800 training server

TensorFlow

tensorflow_multinodes_acjob.yaml

A two-server eight-processor job is presented in the example file by default.

Select the corresponding training framework and obtain the YAML file.

PyTorch

pytorch_multinodes_acjob.yaml

A two-server 16-processor job is presented in the example file by default.

MindSpore

mindspore_multinodes_acjob.yaml

A two-server eight-processor job is presented in the example file by default.

NOTE:

To deliver a single-server eight-processor MindSpore job, change the value of minAvailable in mindspore_multinodes_acjob.yaml to 2 and the value of replicas in Worker to 1.

TensorFlow

tensorflow_standalone_acjob.yaml

A single-server single-processor job is presented in the example file by default.

PyTorch

pytorch_standalone_acjob.yaml

MindSpore

mindspore_standalone_acjob.yaml

Table 3 YAML files supported by Atlas A3 training product

Job Type

Hardware Model

Training Framework

YAML File Name

Description

How to Obtain

AscendJob

Atlas 900 A3 SuperPoD

TensorFlow

tensorflow_standalone_acjob_super_pod.yaml

A single-server single-processor job is presented in the example file by default.

Select the corresponding training framework and obtain the YAML file.

PyTorch

pytorch_standalone_acjob_super_pod.yaml

A single-server 16-processor job is presented in the example file by default.

MindSpore

mindspore_standalone_acjob_super_pod.yaml

A two-server 16-processor job is presented in the example file by default.

Resource Information Configuration Using Configuration Files

  • If Atlas A2 training product are used in the current environment, refer to Table 4 to obtain the corresponding YAML example.

    Then, modify the YAML files of Atlas 800T A2 training server, Atlas 200T A2 Box16 heterogeneous subrack, and A200T A3 Box8 SuperPoD Server based on the parameter description provided in Table 2.

  • If Atlas training product are used in the current environment, refer to Table 5 to obtain the corresponding YAML example.
Table 4 YAML files supported by Atlas A2 training product

Job Type

Hardware Model

Training Framework

YAML File Name

Description

How to Obtain

VolcanoJob

Atlas 900 A2 PoD cluster basic unit

TensorFlow

a800_tensorflow_vcjob.yaml

A single-server 16-processor job is presented in the example file by default.

YAML

PyTorch

a800_pytorch_vcjob.yaml

MindSpore

a800_mindspore_vcjob.yaml

Deployment

Atlas 900 A2 PoD cluster basic unit

TensorFlow

a800_tensorflow_deployment.yaml

A single-server 16-processor job is presented in the example file by default.

YAML

PyTorch

a800_pytorch_deployment.yaml

MindSpore

a800_mindspore_deployment.yaml

Table 5 YAML files supported by Atlas training product

Job Type

Hardware Model

Training Framework

YAML File Name

Description

How to Obtain

VolcanoJob

Atlas 800 training server

TensorFlow

a800_tensorflow_vcjob.yaml

A single-server eight-processor job is presented in the example file by default.

YAML

PyTorch

a800_pytorch_vcjob.yaml

MindSpore

a800_mindspore_vcjob.yaml

Server (with Atlas 300T training cards)

TensorFlow

a300t_tensorflow_vcjob.yaml

A single-server single-processor job is presented in the example file by default.

PyTorch

a300t_pytorch_vcjob.yaml

MindSpore

a300t_mindspore_vcjob.yaml

Deployment

Atlas 800 training server

TensorFlow

a800_tensorflow_deployment.yaml

A single-server eight-processor job is presented in the example file by default.

PyTorch

a800_pytorch_deployment.yaml

MindSpore

a800_mindspore_deployment.yaml

Server (with Atlas 300T training cards)

TensorFlow

a300t_tensorflow_deployment.yaml

A single-server single-processor job is presented in the example file by default.

PyTorch

a300t_pytorch_deployment.yaml

A single-server eight-processor job is presented in the example file by default.

MindSpore

a300t_mindspore_deployment.yaml

A single-server single-processor job is presented in the example file by default.