Atlas 900 A3 SuperPoD
The Atlas 900 A3 SuperPoD is a high-performance AI computing cluster developed by Huawei, consisting of multiple compute nodes. On each compute node, two Ascend AI Processors, for example, Ascend AI Processor 0 and Ascend AI Processor 1 are connected through SIO to form a HiAM module. Each compute node contains eight HiAM modules. The HiAM modules are connected in HCCS-L1 mode, and compute nodes are connected in HCCS-L2 mode. SuperPoDs of multiple specifications can be expanded through L1 port cascading and L2 switching interconnection.
The number of Ascend AI Processors that can be allocated to a job is 1, 2, 4, 6, 8, 10, 12, 14, or 16. The allocated Ascend AI Processors must preferentially occupy the entire compute mode. If the allocated Ascend AI Processors form an even number, the entire HiAM module must be occupied. For example, if the number of Ascend AI Processors allocated to a job is 2 and the remaining Ascend AI Processor IDs of a compute node are 0, 2, 3, and 4, the job can use only Ascend AI Processors 2 and 3 because only they are in the same HiAM module. The number of Ascend AI Processors that can be allocated to a distributed job is 2, 4, 6, 8, 10, 12, 14, or 16. For a logical SuperPoD affinity job, that is, the sp-block field in the job YAML file is set to the logical SuperPoD size, the number of Ascend AI Processors that can be allocated is 16.
UnifiedBus Interconnect Device Network Description
- Compute nodes in the same logical SuperPoD communicate with each other through HCCS, and compute nodes in different logical SuperPoDs communicate with each other through RoCE. If the number of logical SuperPoDs of a job (Number of logical SuperPoDs of a job = Total number of processors of a job/sp-block) is greater than 1, ensure that the RoCE network connectivity between compute nodes is normal.
- Assume that the number of processors on a compute node is 16, the total number of processors of a job is 64, and sp-block is 32, the job will be divided into two logical SuperPoDs. That is, pod (rank=0) and pod (rank=1) form one logical SuperPoD, and pod (rank=2) and pod (rank=3) form the other logical SuperPoD.
- In this case, pod (rank=0) and pod (rank=1) communicate with each other over the HCCS network, and pod (rank=2) and pod (rank=3) communicate with each other also over the HCCS network. However, pod (rank=0/1) and pod (rank=2/3) communicate with each other over the RoCE network.
