Model Training Job Description

When other schedulers are used, the restrictions on training jobs vary based on the server type as follows. When Volcano is used as the scheduler, the following restrictions are met during task scheduling:

Table 1 Training job usage description

Product Name

Training Scenario

Instruction

Atlas 800 training server (fully populated with NPUs)

Single-node scenario

Number of NPUs that can be allocated: 1, 2, 4, or 8.

If two or four NPUs are allocated, the NPUs allocated based on the affinity rules must be in the same area of the same server (NPUs 0 to 3 form one area and NPUs 4 to 7 form another one).

For example, if two NPUs are allocated for training, both of them must be deployed either in area 1 (NPUs 0 to 3) or area 2 (NPUs 4 to 7) of the same server. Specifically, they cannot be in area 1 and area 2 at the same time.

Distributed scenario

Number of NPUs that can be allocated: 1N, 2N, 4N, or 8N.

N indicates the number of nodes. The NPU scheduling restrictions of each node are the same as those in the single-node scenario.

Atlas 800 training server (half populated with NPUs)

Single-node scenario

Number of NPUs that can be allocated: 1, 2, or 4.

Distributed scenario

Number of NPUs that can be allocated: 1N, 2N, or 4N. N indicates the number of nodes.

Atlas 200T A2 Box16 heterogeneous subrack

Single-node scenario

Number of NPUs that can be allocated: 1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, or 16.

  • If the number of allocated NPUs is less than 8, the NPUs allocated based on the affinity rules must be in the same area of the same server (NPUs 0 to 7 form one area and NPUs 8 to 16 form another one).
  • If the number of allocated NPUs is 10, 12, or 14, the required NPUs need to be evenly allocated to two areas, and their physical addresses must be the same. For example, if two NPUs are allocated for training, both of them must be deployed either in area 1 (NPUs 0 to 7) or area 2 (NPUs 8 to 16) of the same server. Specifically, they cannot be in area 1 and area 2 at the same time.

Distributed scenario

Number of NPUs that can be allocated: 1N, 2N, 3N, 4N, 5N, 6N, 7N, 8N, 10N, 12N, 14N, or 16N.

  • N indicates the number of nodes. The NPU scheduling restrictions of each node are the same as those in the single-node scenario.
  • If the number of allocated NPUs is 10N, 12N, or 14N, the required NPUs need to be evenly allocated to two areas, and their physical addresses can be different.

Atlas 800T A2 training server or Atlas 900 A2 PoD cluster basic unit

Single-node scenario

Number of NPUs that can be allocated: 1, 2, 3, 4, 5, 6, 7, or 8.

Distributed scenario

Number of NPUs that can be allocated: 1N, 2N, 3N, 4N, 5N, 6N, 7N, 8N, or 16N. N indicates the number of nodes.

Atlas 900 A3 SuperPoD

Single-node scenario

Number of NPUs that can be allocated: 1, 2, 4, 6, 8, 10, 12, 14, or 16.

Distributed scenario

Number of NPUs that can be allocated: 2, 4, 6, 8, 10, 12, 14, or 16. If the task is a logical SuperPoD affinity task, that is, the sp-block field in the task YAML file is configured with the logical SuperPoD size, the number of NPUs that can be allocated is 16.

Note:

For pods that do not use NPUs, there is no requirement on the NPU quantity.