Modifying inventory_file

To use this method to configure the NIC IP address of a device, ensure that the Python version of the device where the ascend-deployer tool is deployed is 3.7.5 or later and the resources/pylibs directory exists after the ascend-deployer tool performs the download operation.

To synchronize model parameters on each training node for subsequent distributed training, configure the parameter plane network for the NPU network port on each training node. This section describes how to configure network information of NPU network ports on training nodes in batches. For details about the configurations of the switch and parameter plane network, see "Network Configurations on the Parameter Plane " in the Ascend Training Solution 23.0.0 Networking Guide.

The policies for configuring the IP network segment of the NPU network port vary according to training products.

  • For the Atlas 800T A2 training server and Atlas 900 A2 PoD cluster basic unit, the IP addresses of the eight NPU network ports on each training node must be in the same network segment, for example, 10.20.0.x.

● For the Atlas 800 training server (model 9000), Atlas 800 training server (model 9010), Atlas 900 PoD (model 9000), and Atlas 900T PoD Lite, four IP network segments need to be planned for the eight NPU network ports on each training node. NPU0 and NPU4 share the same network segment. So do NPU1 and NPU5, NPU2 and NPU6, and NPU3 and NPU7.

  1. Go to ascend-deployer/ascend_deployer, modify the hccn configuration area in inventory_file, and save the modification. Before applying settings, configure password-free login for all nodes. The following is an example:
    [hccn]
    xx.xx.xx.xx ansible_ssh_user="root" deviceip=xx.xx.xx.xx,xx.xx.xx.xx detectip=xx.xx.xx.xx,xx.xx.xx.xx // Enter the service IP address, remote login user, and IP address of the NPU network port in sequence of a training node (one line for each node).
    xx.xx.xx.xx ansible_ssh_user="root" deviceip=xx.xx.xx.xx,xx.xx.xx.xx detectip=xx.xx.xx.xx,xx.xx.xx.xx
    [hccn:vars]
    gateways="xx.xx.xx.xx"
    netmask="255.255.255.0"
    roce_port=4791
    bitmap="0,0,0,1,0,0,0,0"
    dscp_tc="xx:x,"
    common_network=""
    Table 1 hccn variable configurations

    Field

    Mandatory/Optional

    Description

    IP

    Mandatory

    xx.xx.xx.xx before ansible_ssh_user indicates the service IP address of a training node, which is used for SSH remote login.

    ansible_ssh_user

    Mandatory

    Account of the SSH remote training node. Set this parameter to root.

    deviceip

    Mandatory

    IP address of each NPU network port. For example, if there are eight NPU network ports on a training node (NPU0 to NPU7), fill related information as follows:

    IPv4:

    10.20.0.2,10.20.0.3,10.20.0.4,10.20.0.5,10.20.0.6,10.20.0.7,10.20.0.8,10.20.0.9 (The comma must be in English format.)

    IPv6:

    fec0:0090:1c02::3101,fec0:0090:1c02::3102,fec0:0090:1c02::3103,fec0:0090:1c02::3104,fec0:0090:1c02::3105,fec0:0090:1c02::3106,fec0:0090:1c02::3107,fec0:0090:1c02::3108 (The comma must be in English format.)

    detectip

    Mandatory

    IP address of the network detection object for each NPU network port. This parameter is used to check the network status. You can set the IP address of the detection object to a gateway address in the network segment. The training node periodically checks whether the communication between the NPU network port and the gateway address is normal. For example, if there are eight NPU network ports on a training node (NPU0 to NPU7), fill related information as follows:

    IPv4:

    10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1 (The comma must be in English format.)

    IPv6:

    fec0:0090:1c02::2001,fec0:0090:1c02::2001,fec0:0090:1c02::2001,fec0:0090:1c02::2001,fec0:0090:1c02::2001,fec0:0090:1c02::2001,fec0:0090:1c02::2001,fec0:0090:1c02::2001 (The comma must be in English format.)

    gateways

    Mandatory

    Gateway address, for example:

    IPv4: 10.20.0.1

    IPv6: fec0:0090:1c02::2001

    If there are multiple gateway addresses in the NPU network port networking, separate them by commas (,). For example:

    IPv4: 10.20.0.1,10.20.1.1 (The comma must be in English format.)

    IPv6: fec0:0090:1c02::2001,fec0:0090:1c03::2001 (The comma must be in English format.)

    netmask

    Mandatory

    Subnet mask, for example, 255.255.255.0 for IPv4 and 112 for IPv6. Only one subnet mask can be input. The subnet masks of the NPU network port IP addresses of all training nodes must be the same.

    roce_port

    -

    Reserved.

    bitmap

    Optional

    PFC priority queue configuration. By default, PFC with priority queue 4 is enabled for the NPU network port on the NPU parameter plane. It is recommended that the PFC configuration policy of the onsite switch be the same as that of the NPU network port.

    To retain the default value (priority queue 4), leave this parameter empty.

    If PFC of the onsite switch is set to another value and cannot be changed, configure the NPU network port to ensure that the PFC policy of NPU network ports of all training servers in the networking is the same.

    Each bit in the character string of the bitmap parameter corresponds to a priority queue. There are eight priority queues. 1 indicates that PFC is enabled, and 0 indicates that PFC is disabled. From left to right, the first bit indicates the configuration of priority queue 0, and so on. For example, if this parameter is set to 0,0,0,1,0,0,0,0, PFC is enabled for priority queue 3 on the NPU network port.

    For details about PFC configuration principles, see "Network Configurations on the Parameter Plane > Network Configurations > Congestion Control and Error Correction Configuration" in the Ascend Training Solution 23.0.0 Networking Guide.

    dscp_tc

    Optional

    Mapping attribute between DSCP and TC in the format of DCSP value:TC value,.

    Note that a comma (,) must be added to the end of the TC value.

    By default, the NPU network port on the parameter plane maps DSCP 33 to priority queue 4 (corresponding to TC2). It is recommended that the onsite switch mapping configuration policy be the same as that of the NPU network port.

    To retain the default value, leave this parameter empty.

    If the onsite switch mapping attribute is another value and cannot be changed, the NPU network port needs to be configured accordingly.

    The mapping between priority queues and TCs is as follows:

    ● Priority queues 0, 1, and 2 correspond to TC3.

    ● Priority queues 3 and 4 correspond to TC2.

    ● Priority queue 5 corresponds to TC1.

    ● Priority queues 6 and 7 correspond to TC0.

    The PFC policy of the NPU network ports on all training servers in the networking must be the same.

    For details about DSCP and TC mapping, see "Network Configurations on the Parameter Plane > Network Configurations > Congestion Control and Error Correction Configuration" in the Ascend Training Solution 23.0.0 Networking Guide.

    common

    _network

    -

    For the Atlas 800T A2 training server, Atlas 900 A2 PoD cluster basic unit, and Atlas 200T A2 Box16 heterogeneous subrack, leave this parameter empty.

    For the Atlas 800 training server (model 9000), Atlas 800 training server (model 9010), Atlas 900 PoD (model 9000), and Atlas 900T PoD Lite, set this parameter to 0.0.0.0/0.

    Concepts involved in the configuration parameters:

    • PFC (Priority-based Flow Control): It is a priority-based flow control mechanism of the Ethernet protocol. When the receive buffer of a port is congested, backpressure packets are sent to the peer end, and the peer end stops sending packets in the corresponding priority queue.

    ● DSCP (Differentiated Services Code Point): a field used to classify the service type and service priority in the IP packet header. The value ranges from 0 to 63.

    ● Priority queue: QoS queues implemented inside switches and NICs, including eight priority queues from 0 to 7.

    ● Traffic Class (TC): Network devices classify packets into different service classes and use corresponding scheduling policies. Generally, a switch divides packets into eight TCs ranging from 0 to 7, which correspond to priority queues. The NPU parameter plane network port divides the packet into four TCs ranging from 0 to 3.

    • For the Atlas 800T A2 training server, Atlas 900 A2 PoD cluster basic unit, and Atlas 200T A2 Box16 heterogeneous subrack, see the following example.

      In the example, there are four training nodes, and each node has eight NPU network ports sharing the same network segment. These nodes are connected to two switches, with each switch linked to two nodes. Each switch operates on a single network segment, which corresponds to two gateway addresses. The PFC priority queue is 3, and DSCP25 is mapped to TC2.

      • IPv4:
        [hccn]
        192.168.10.2 ansible_ssh_user="root"
        deviceip=10.20.0.2,10.20.0.3,10.20.0.4,10.20.0.5,10.20.0.6,10.20.0.7,10.20.0.8,10.20.0.9
        detectip=10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1
        192.168.10.3 ansible_ssh_user="root"
        deviceip=10.20.0.10,10.20.0.11,10.20.0.12,10.20.0.13,10.20.0.14,10.20.0.15,10.20.0.16,10.20.0.17
        detectip=10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1,10.20.0.1
        192.168.10.4 ansible_ssh_user="root"
        deviceip=10.20.1.2,10.20.1.3,10.20.1.4,10.20.1.5,10.20.1.6,10.20.1.7,10.20.1.8,10.20.1.9
        detectip=10.20.1.1,10.20.1.1,10.20.1.1,10.20.1.1,10.20.1.1,10.20.1.1,10.20.1.1,10.20.1.1
        192.168.10.5 ansible_ssh_user="root"
        deviceip=10.20.1.10,10.20.1.11,10.20.1.12,10.20.1.13,10.20.1.14,10.20.1.15,10.20.1.16,10.20.1.17
        detectip=10.20.1.1,10.20.1.1,10.20.1.1,10.20.1.1,10.20.1.1,10.20.1.1,10.20.1.1,10.20.1.1
        [hccn:vars]
        gateways="10.20.0.1,10.20.1.1"
        netmask="255.255.255.0"
        roce_port=
        bitmap="0,0,0,1,0,0,0,0"
        dscp_tc="25:2,"
        common_network=""
      • IPv6
        [hccn]
        FEC0:3::180 ansible_ssh_user="root"
        deviceip=fec0:0090:1c02::3101,fec0:0090:1c02::3102,fec0:0090:1c02::3103,fec0:0090:1c02::3104,fec0:0090:1c02::3105,fec0:0090:1c02::3106,fec0:0090:1c02::3107,fec0:0090:1c02::3108
        detectip=fec0:0090:1c02::2001,fec0:0090:1c02::2001,fec0:0090:1c02::2001,fec0:0090:1c02::2001,fec0:0090:1c02::2001,fec0:0090:1c02::2001,fec0:0090:1c02::2001,fec0:0090:1c02::2001
        FEC0:3::181 ansible_ssh_user="root"
        deviceip=fec0:0090:1c02::3109,fec0:0090:1c02::3110,fec0:0090:1c02::3111,fec0:0090:1c02::3112,fec0:0090:1c02::3113,fec0:0090:1c02::3114,fec0:0090:1c02::3115,fec0:0090:1c02::3116
        detectip=fec0:0090:1c02::2001,fec0:0090:1c02::2001,fec0:0090:1c02::2001,fec0:0090:1c02::2001,fec0:0090:1c02::2001,fec0:0090:1c02::2001,fec0:0090:1c02::2001,fec0:0090:1c02::2001
        FEC0:3::182 ansible_ssh_user="root"
        deviceip=fec0:0090:1c03::3101,fec0:0090:1c03::3102,fec0:0090:1c03::3103,fec0:0090:1c03::3104,fec0:0090:1c03::3105,fec0:0090:1c03::3106,fec0:0090:1c03::3107,fec0:0090:1c03::3108
        detectip=fec0:0090:1c03::2001,fec0:0090:1c03::2001,fec0:0090:1c03::2001,fec0:0090:1c03::2001,fec0:0090:1c03::2001,fec0:0090:1c03::2001,fec0:0090:1c03::2001,fec0:0090:1c03::2001
        FEC0:3::183 ansible_ssh_user="root"
        deviceip=fec0:0090:1c03::3109,fec0:0090:1c03::3110,fec0:0090:1c03::3111,fec0:0090:1c03::3112,fec0:0090:1c03::3113,fec0:0090:1c03::3114,fec0:0090:1c03::3115,fec0:0090:1c03::3116
        detectip=fec0:0090:1c03::2001,fec0:0090:1c03::2001,fec0:0090:1c03::2001,fec0:0090:1c03::2001,fec0:0090:1c03::2001,fec0:0090:1c03::2001,fec0:0090:1c03::2001,fec0:0090:1c03::2001
        [hccn:vars]
        gateways="fec0:0090:1c02::2001,fec0:0090:1c03::2001"
        netmask="112"
        roce_port=
        bitmap="0,0,0,1,0,0,0,0"
        dscp_tc="25:2,"
        common_network=""
    • For the Atlas 800 training server (model 9000), Atlas 800 training server (model 9010), Atlas 900 PoD, and Atlas 900T PoD Lite, see the following example.

      In the example, two training nodes are used. The IP addresses of the NPU network ports on each node are in four network segments (corresponding to four gateway addresses), the PFC priority queue is 3, and DSCP25 is mapped to TC2.

      [hccn]
      192.168.10.2 ansible_ssh_user="root" deviceip=10.20.1.2,10.20.2.3,10.20.3.4,10.20.4.5,10.20.1.6,10.20.2.7,10.20.3.8,10.20.4.9 detectip=10.20.1.1,10.20.2.1,10.20.3.1,10.20.4.1,10.20.1.1,10.20.2.1,10.20.3.1,10.20.4.1  
      192.168.10.3 ansible_ssh_user="root" deviceip=10.20.1.10,10.20.2.11,10.20.3.12,10.20.4.13,10.20.1.14,10.20.2.15,10.20.3.16,10.20.4.17 detectip=10.20.1.1,10.20.2.1,10.20.3.1,10.20.4.1,10.20.1.1,10.20.2.1,10.20.3.1,10.20.4.1   
      [hccn:vars] 
      gateways="10.20.1.1,10.20.2.1,10.20.3.1,10.20.4.1" 
      netmask="255.255.255.0" 
      roce_port=
      bitmap="0,0,0,1,0,0,0,0" 
      dscp_tc="25:2," 
      common_network="0.0.0.0/0"
  2. Run the script to complete the configuration. When the script is executed, Ansible is installed in the current environment by default. If it has been installed, the installation is skipped.
    • Method 1: Run ascend-deployer.py to invoke hccn.
      ascend-deployer --hccn
    • Method 2: Run the hccn script.
      bash ./scripts/hccn_set.sh