Precautions

Before performing distributed training, refer to this section for some precautions.

Before starting distributed training across multiple processes, you need to configure the resource information of Ascend AI Processors that participate in the distributed training.

Currently, resource information can be configured using configuration files or environment variables. You can choose either of them, but they cannot be used together.
Before performing distributed training, pay attention to the following points:
  1. Atlas Training Series Product: In single-server scenarios, the number of Ascend AI Processors that participate in collective communication can only be 1, 2, 4, or 8. In addition, devices 0 to 3 and devices 4 to 7 form separate networks. When two or four devices are used for training, cross-network clusters cannot be created. In server cluster scenarios, the number of Ascend AI Processors that participate in collective communication can only be 1 x n, 2 x n, 4 x n, or 8 x n (n is the number of servers participating in training). If n is an exponential multiple of 2, the cluster performance is the best. Therefore, this mode is recommended for cluster networking.
  2. Each device corresponds to a training process. It is not supported to run multiple training processes on a single device.
  3. In distributed training scenarios, HCCL uses certain ports of the host server to collect cluster information, requiring the operating system to reserve these ports. By default, HCCL uses ports 60000 to 60015. If the starting port of the host NIC is specified by the environment variable HCCL_IF_BASE_PORT, then 16 ports starting from this port need to be reserved.

    Example of reserved port numbers in the operating system: sysctl -w net.ipv4.ip_local_reserved_ports=60000-60015