Configuring Resource Information
Before training, you need to configure the resource information of Ascend AI Processors that participate in the cluster training. You can set environment variables described in this section to initialize the collective communication component.
Resource information configuration using environment variables is supported only by the following products:
Configuration Description
Set the following environment variables on every AI server node where training is performed to configure resource information. The following is an example:
1 2 3 4 5 6 | export CM_CHIEF_IP = 192.168.1.1 export CM_CHIEF_PORT = 6000 export CM_CHIEF_DEVICE = 0 export CM_WORKER_SIZE = 8 export CM_WORKER_IP = 192.168.0.1 export HCCL_SOCKET_FAMILY=AF_INET |
- CM_CHIEF_IP: host listening IP address of the master node, that is, the IP address used to communicate with other nodes. The value must be in the IPv4 or IPv6 format.
- CM_CHIEF_PORT: listening port of the master node. The value must be an integer ranging from 0 to 65520. Ensure that the port is not occupied by other processes.
- CM_CHIEF_DEVICE: logical ID of the device that collects server cluster information on the master node.
The value of this environment variable must be an integer within the range of [0, Maximum number of devices in the server - 1].
- CM_WORKER_SIZE: total number of devices involved in cluster training on the network. The value must be an integer ranging from 0 to 32768.
- CM_WORKER_IP: IP address of the NIC used by the current node to communicate with the master node. The value must be in the IPv4 or IPv6 format.
- HCCL_SOCKET_FAMILY: (Optional) IP version used by the communication NIC on the device. AF_INET indicates that IPv4 is used, and AF_INET6 indicates that IPv6 is used. By default, IPv4 is used preferentially.
- If the IP specified by the environment variable HCCL_SOCKET_FAMILY does not match the obtained NIC information, use the actual NIC information.
For example, if HCCL_SOCKET_FAMILY is set to AF_INET6 but only the NIC using IPv4 exists on the device, the NIC using IPv4 is used.
- When the preceding environment variables are used to configure cluster information, RANK_TABLE_FILE, RANK_ID, and RANK_SIZE cannot exist.
- If the IP specified by the environment variable HCCL_SOCKET_FAMILY does not match the obtained NIC information, use the actual NIC information.
Example
Assume that there are two server nodes and 16 devices for distributed training. Each server node has eight devices. Before starting training processes on each device, configure the following environment variables in the corresponding shells to configure resource information:
- Node 0 is used as the master node, responsible for managing cluster information, resource allocation, and scheduling.
1 2 3 4 5
export CM_CHIEF_IP = 192.168.1.1 export CM_CHIEF_PORT = 6000 export CM_CHIEF_DEVICE = 0 export CM_WORKER_SIZE = 16 export CM_WORKER_IP = 192.168.1.1
- Node 1
1 2 3 4 5
export CM_CHIEF_IP = 192.168.1.1 export CM_CHIEF_PORT = 6000 export CM_CHIEF_DEVICE = 0 export CM_WORKER_SIZE = 16 export CM_WORKER_IP = 192.168.2.1