Using MindCluster Ascend Deployer
To synchronize model parameters on each training node for subsequent distributed training, configure the parameter plane network for the NPU network port on each training node. This section describes how to configure the network information of NPU network ports in batches on training nodes.
Prerequisites
- The resources/pylibs directory exists on the MindCluster Ascend Deployer executor after MindCluster Ascend Deployer performs the download operation.
Policy Information
The policies for configuring the IP network segment of the NPU network port vary according to training products. The following table lists the hardware that supports HCCN parameter configuration and the configuration policies.
Product |
Policy Information |
|---|---|
Atlas 800T A2 training server Atlas 900 A2 PoD cluster basic unit |
|
Atlas 800 training server (model 9000) Atlas 800 training server (model 9010) Atlas 900 PoD (model 9000) Atlas 900T PoD Lite |
|
Atlas 900 A3 SuperPoD Atlas 800I A3 SuperPoD Server |
|
Procedure
- Go to ascend-deployer/ascend_deployer, modify the [hccn] configuration area in inventory_file, and save the modification. The following is an example:
[hccn] xx.xx.xx.xx ansible_ssh_user="root" deviceip=xx.xx.xx.xx,xx.xx.xx.xx detectip=xx.xx.xx.xx,xx.xx.xx.xx // Enter the service IP address of a training node, remote login account, and IP address of the NPU network port in sequence (one line for each node). xx.xx.xx.xx ansible_ssh_user="root" deviceip=xx.xx.xx.xx,xx.xx.xx.xx detectip=xx.xx.xx.xx,xx.xx.xx.xx [hccn:vars] gateways="xx.xx.xx.xx" netmask="255.255.255.0" roce_port=4791 bitmap="0,0,0,1,0,0,0,0" dscp_tc="xx:x," common_network=""
Table 1 HCCN variable configurations Field
Required or Not
Description
IP
Mandatory
xx.xx.xx.xx before ansible_ssh_user indicates the service IP address of a training node, which is used for SSH remote login.
ansible_ssh_user
Yes
Account of the SSH remote training node. Set this parameter to root.
deviceip
Yes
IP address of each NPU network port (planned by the user as required). For example, if there are eight NPU network ports on a training node (NPU0 to NPU7), fill related information as follows:
IPv4:
10.20.0.2,10.20.0.3,10.20.0.4,10.20.0.5,10.20.0.6,10.20.0.7,10.20.0.8,10.20.0.9 (The comma must be in English format.)
IPv6:
2001:0db8:85a3::8a2e:0370:7330,2001:0db8:85a3::8a2e:0370:7331,2001:0db8:85a3::8a2e:0370:7332,2001:0db8:85a3::8a2e:0374:7333,2001:0db8:85a3::8a2e:0370:7334,2001:0db8:85a3::8a2e:0370:7335,2001:0db8:85a3::8a2e:0370:7336,2001:0db8:85a3::8a2e:0370:7337 (The comma must be in English format.)
Only Atlas 800T A2 training server, Atlas 900 A2 PoD cluster basic unit, and Atlas 200T A2 Box16 heterogeneous subrack support IPv6. IPv4 and IPv6 cannot be used together.
detectip
Yes
IP address of the network detection object for each NPU network port. This parameter is used to check the network status. You can set the IP address of the detection object to a gateway address in the network segment. The training node periodically checks whether the communication between the NPU network port and the gateway address is normal. For example, if there are eight NPU network ports on a training node (NPU0 to NPU7), fill related information as follows:
IPv4:
10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1 (The comma must be in English format.)
IPv6:
2001:0db8:85a3::8a2e:0370:7330,2001:0db8:85a3::8a2e:0370:7331,2001:0db8:85a3::8a2e:0370:7332,2001:0db8:85a3::8a2e:0374:7333,2001:0db8:85a3::8a2e:0370:7334,2001:0db8:85a3::8a2e:0370:7335,2001:0db8:85a3::8a2e:0370:7336,2001:0db8:85a3::8a2e:0370:7337 (The comma must be in English format.)
Only Atlas 800T A2 training server, Atlas 900 A2 PoD cluster basic unit, and Atlas 200T A2 Box16 heterogeneous subrack support IPv6. IPv4 and IPv6 cannot be used together.
gateways
Yes
Gateway. Ensure that the gateway is available because the gateway needs to forward packets.
Configuration examples:
IPv4: 10.20.0.1
IPv6: 2001:0db8:85a3::8a2e:0370:1.
If there are multiple gateway addresses in the NPU network port networking, separate them by commas (,). For example:
IPv4: 10.20.0.1,10.20.1.1 (The comma must be in English format.)
IPv6: 2001:0db8:85a3::8a2e:0370:1,2001:0db8:85a3::8a2e:0371:1 (The comma must be in English format.)
netmask
Yes
Subnet mask, for example, 255.255.255.0 for IPv4 and 64 for IPv6. Only one subnet mask can be input. The subnet masks of the NPU network port IP addresses of all training nodes must be the same.
roce_port
-
Reserved port. You do not need to modify the configuration.
bitmap
No
PFC priority queue configuration. By default, PFC with priority queue 4 is enabled for the NPU network port on the NPU parameter plane. It is recommended that the PFC configuration policy of the onsite switch be the same as that of the NPU network port.
To retain the default value, leave this parameter empty.
If the PFC of the onsite switch is set to another value and cannot be changed, the NPU network port needs to be configured accordingly. The PFC policies of the NPU network ports of all training servers on the network must be the same.
Each bit in the character string of the bitmap parameter corresponds to a priority queue. There are eight priority queues. 1 indicates that PFC is enabled, and 0 indicates that PFC is disabled. From left to right, the first bit indicates the configuration of priority queue 0, and so on. For example, if this parameter is set to 0,0,0,1,0,0,0,0, PFC is enabled for priority queue 3 on the NPU network port.
dscp_tc
No
Mapping attribute between DSCP and TC in the format of DSCP value:TC value,.
Note that a comma (,) must be added to the end of the TC value.
By default, the NPU network port on the parameter plane maps DSCP 33 to priority queue 4 (corresponding to TC2). It is recommended that the onsite switch mapping configuration policy be the same as that of the NPU network port.
To retain the default value, leave this parameter empty.
If the onsite switch mapping attribute is another value and cannot be changed, the NPU network port needs to be configured accordingly.
The mapping between priority queues and TCs is as follows:
● Priority queues 0, 1, and 2 correspond to TC3.
● Priority queues 3 and 4 correspond to TC2.
● Priority queue 5 corresponds to TC1.
● Priority queues 6 and 7 correspond to TC0.
The PFC policy of the NPU network ports on all training servers in the networking must be the same.
common_network
-
For the Atlas 800T A2 training server, Atlas 900 A2 PoD cluster basic unit, and Atlas 200T A2 Box16 heterogeneous subrack, leave this parameter empty or set it to 0.0.0.0/0.
For the Atlas 800 training server (model 9000), Atlas 800 training server (model 9010), Atlas 900 PoD (model 9000),
Atlas A3 training product , and Atlas 900T PoD Lite, set this parameter to 0.0.0.0/0.
Concepts involved in the configuration parameters:
- PFC (Priority-based Flow Control): It is a priority-based flow control mechanism of the Ethernet protocol. When the receive buffer of a port is congested, backpressure packets are sent to the peer end, and the peer end stops sending packets in the corresponding priority queue.
- DSCP: a field used to classify the service type and service priority in the IP packet header. The value ranges from 0 to 63.
- Priority queue: QoS queues implemented inside switches and NICs, including eight priority queues from 0 to 7.
- Traffic Class (TC): Network devices classify packets into different service classes and use corresponding scheduling policies. Generally, a switch divides packets into eight TCs ranging from 0 to 7, which correspond to priority queues. The NPU parameter plane network port divides the packet into four TCs ranging from 0 to 3.
- For the Atlas 800T A2 training server and Atlas 900 A2 PoD cluster basic unit, see the following example.
In the example, there are two training nodes, and each node has eight NPU network ports sharing the same network segment. These nodes are connected to two switches, with each switch linked to two nodes. Each switch operates on a single network segment, which corresponds to two gateway addresses. The PFC priority queue is 3, and DSCP25 is mapped to TC2.
- IPv4:
[hccn] 192.168.10.2 ansible_ssh_user="root" deviceip=10.20.0.2,10.20.0.3,10.20.0.4,10.20.0.5,10.20.0.6,10.20.0.7,10.20.0.8,10.20.0.9 detectip=10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1 192.168.10.3 ansible_ssh_user="root" deviceip=10.20.0.10,10.20.0.11,10.20.0.12,10.20.0.13,10.20.0.14,10.20.0.15,10.20.0.16,10.20.0.17 detectip=10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1 [hccn:vars] gateways="10.20.0.1,10.20.1.1" netmask="255.255.255.0" roce_port=4791 bitmap="" dscp_tc="" common_network=""
- IPv6
[hccn] 3fff:0050:4400::222 ansible_ssh_user="root" deviceip=2001:0db8:85a3::8a2e:0370:7330,2001:0db8:85a3::8a2e:0370:7331,2001:0db8:85a3::8a2e:0370:7332,2001:0db8:85a3::8a2e:0374:7333,2001:0db8:85a3::8a2e:0370:7334,2001:0db8:85a3::8a2e:0370:7335,2001:0db8:85a3::8a2e:0370:7336,2001:0db8:85a3::8a2e:0370:7337 detectip=2001:0db8:85a3::8a2e:0370:7340,2001:0db8:85a3::8a2e:0370:7341,2001:0db8:85a3::8a2e:0370:7342,2001:0db8:85a3::8a2e:0374:7343,2001:0db8:85a3::8a2e:0370:7344,2001:0db8:85a3::8a2e:0370:7345,2001:0db8:85a3::8a2e:0370:7346,2001:0db8:85a3::8a2e:0370:7347 3fff:0050:4400::220 ansible_ssh_user="root" deviceip=2001:0db8:85a3::8a2e:0370:7340,2001:0db8:85a3::8a2e:0370:7341,2001:0db8:85a3::8a2e:0370:7342,2001:0db8:85a3::8a2e:0374:7343,2001:0db8:85a3::8a2e:0370:7344,2001:0db8:85a3::8a2e:0370:7345,2001:0db8:85a3::8a2e:0370:7346,2001:0db8:85a3::8a2e:0370:7347 detectip=2001:0db8:85a3::8a2e:0370:7330,2001:0db8:85a3::8a2e:0370:7331,2001:0db8:85a3::8a2e:0370:7332,2001:0db8:85a3::8a2e:0374:7333,2001:0db8:85a3::8a2e:0370:7334,2001:0db8:85a3::8a2e:0370:7335,2001:0db8:85a3::8a2e:0370:7336,2001:0db8:85a3::8a2e:0370:7337 [hccn:vars] gateways="2001:0db8:85a3::8a2e:0370:1" netmask="64" roce_port=4791 bitmap="" dscp_tc="" common_network="" [other_build_image] [all:vars] SCALE="false"
- IPv4:
- For the Atlas 900 A3 SuperPoD, see the following example.
- IPv4: Two compute nodes are used, and each compute node has eight NPU modules and 16 devices.
[hccn] 192.168.10.2 ansible_ssh_user="root" deviceip=10.20.0.2,10.20.1.2,10.20.0.3,10.20.1.3,10.20.0.4,10.20.1.4,10.20.0.5,10.20.1.5,10.20.0.6,10.20.1.6,10.20.0.7,10.20.1.7,10.20.0.8,10.20.1.8,10.20.0.9,10.20.1.9 detectip=10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1 192.168.10.3 ansible_ssh_user="root" deviceip=10.20.0.10,10.20.1.10,10.20.0.11,10.20.1.11,10.20.0.12,10.20.1.12,10.20.0.13,10.20.1.13,10.20.0.14,10.20.1.14,10.20.0.15,10.20.1.15,10.20.0.16,10.20.1.16,10.20.0.17,10.20.1.17 detectip=10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1 [hccn:vars] gateways="10.20.0.1,10.20.1.1" netmask="255.255.255.0" roce_port=4791 bitmap="" dscp_tc="" common_network="0.0.0.0/0"
- IPv4: Two compute nodes are used, and each compute node has eight NPU modules and 16 devices.
- For the Atlas 800 training server (model 9000), Atlas 800 training server (model 9010), Atlas 900 PoD, and Atlas 900T PoD Lite, see the following example.
In the example, two training nodes are used. The IP addresses of the NPU network ports on each node are in four network segments (corresponding to four gateway addresses), the PFC priority queue is 3, and DSCP25 is mapped to TC2.
[hccn] 192.168.10.2 ansible_ssh_user="root" deviceip=10.20.1.2,10.20.2.3,10.20.3.4,10.20.4.5,10.20.1.6,10.20.2.7,10.20.3.8,10.20.4.9 detectip=10.20.1.1,10.20.2.1,10.20.3.1,10.20.4.1,10.20.1.1,10.20.2.1,10.20.3.1,10.20.4.1 192.168.10.3 ansible_ssh_user="root" deviceip=10.20.1.10,10.20.2.11,10.20.3.12,10.20.4.13,10.20.1.14,10.20.2.15,10.20.3.16,10.20.4.17 detectip=10.20.1.1,10.20.2.1,10.20.3.1,10.20.4.1,10.20.1.1,10.20.2.1,10.20.3.1,10.20.4.1 [hccn:vars] gateways="10.20.1.1,10.20.2.1,10.20.3.1,10.20.4.1" netmask="255.255.255.0" roce_port=4791 bitmap="" dscp_tc="" common_network="0.0.0.0/0"
- Run the script to check inventory_file.
Method 1: Execute the ascend-deployer.py file to check.
ascend-deployer --hccn --check
Method 2: Run the bash check command.bash install.sh --hccn --check
- Run the script to complete the configuration. When the script is executed, Ansible is installed in the current environment by default. If it has been installed, the installation is skipped.
- Method 1: Run ascend-deployer.py to invoke hccn.
ascend-deployer --hccn
- Method 2: Run the hccn script.
bash install.sh --hccn
- Method 1: Run ascend-deployer.py to invoke hccn.
- (Optional) To query the IP address and gateway of the RoCE NIC, configure the IP address for the network detection object, and query LLDP information.