Precautions

Before performing distributed training, refer to this section for some precautions.

Before starting distributed training across multiple processes, you need to configure the resource information of Ascend AI Processors that participate in the distributed training.

Currently, resource information can be configured using configuration files or environment variables. You can choose either of them, but they cannot be used together.
Before performing distributed training, pay attention to the following points:
  1. Atlas training products: In single-server scenarios, the number of Atlas training productss that participate in collective communication can be 1, 2, 4, or 8. In addition, devices 0 to 3 and devices 4 to 7 form separate networks. When two or four devices are used for training, cross-network clusters cannot be created. In server cluster scenarios, the number of Ascend AI Processors that participate in collective communication can only be 1 × n, 2 × n, 4 × n, or 8 × n (n is the number of servers participating in training). If n is an exponential multiple of 2, the cluster performance is the best. Therefore, this mode is recommended for cluster networking.
  2. Atlas A2 training products/Atlas A2 inference products: In single-server scenarios, the number of Ascend AI Processors that participate in collective communication is not limited. In server cluster scenarios, the number of Ascend AI Processors that participate in collective communication must be (1 to 8) × n (n is the number of servers participating in training). It is recommended that each server should have the same number of Ascend AI Processors that participate in collective communication. Otherwise, the performance deteriorates.
  3. Atlas A3 training products/Atlas A3 inference products: It is recommended that each supernode should have the same number of servers and each server should have the same number of Ascend AI Processors. Otherwise, the performance deteriorates.
  4. One device corresponds to one training process. It is not supported to run multiple training processes on a single device.