Preparing the Ranktable Resource Configuration File

You can use the ranktable file in JSON format to configure NPU resource information for collective communication. In this file, you can configure all NPU resource information so that the specified NPU resources can be used when the process is started.

The comments of examples shown in this section in the JSON file are for reference only. Delete the comments in the JSON file in actual use.

Configuration Description of the Atlas Training Series Product

For the Atlas Training Series Product, configure the information of Ascend AI Processors that participate in training in the ranktable file. Currently, two configuration templates are supported. Template 1 is recommended for development from scratch, and template 2 is compatible with certain existing scenarios.
  • (Recommended) Template 1

    The following is an example of configuring the ranktable file when there are two AI servers and each AI server has two devices:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    {
    "status":"completed",   // Ranktable availability flag. The value completed indicates that the ranktable is available.
    "version":"1.0",        // Ranktable template version information. Set this option to 1.0.
    "server_count":"2",  // Number of AI servers participating in training. In this example, there are two AI servers.
    "server_list":
    [
       {
            "server_id":"node_0",   // AI server ID, which is of the string type. Ensure that the ID is globally unique.
            "device":[   // List of devices on the AI server
                           {
                            "device_id":"0",   // Physical ID of the device
                            "device_ip":"192.168.1.8",    // Actual NIC IP address of the device
                            "rank_id":"0"     // Rank ID, which starts from 0 and must be globally unique.
                            },
                            {
                             "device_id":"1",
                             "device_ip":"192.168.1.9", 
                             "rank_id":"1"
                             }
                      ]
        },
       {
            "server_id":"node_1",
            "device":[
                           {
                            "device_id":"0",
                            "device_ip":"192.168.2.8",
                            "rank_id":"2"
                            },
                            {
                             "device_id":"1",
                             "device_ip":"192.168.2.9", 
                             "rank_id":"3"
                             }
                      ]
    
        }
    ]
    }
    
    Table 1 Description of the ranktable file

    Configuration Option

    Description

    Required (Yes/No)

    status

    Ranktable availability flag.

    • completed: The ranktable is available.
    • initializing: The ranktable is unavailable.

    Yes

    version

    Version of a ranktable template.

    Set the value to 1.0.

    Yes

    server_count

    Number of AI servers that participate in the training.

    Yes

    server_list

    List of AI servers that participate in the training.

    Yes

    server_id

    AI server ID. The value is a string of less than 64 characters. Ensure that the value is globally unique.

    Example: node_0

    Yes

    device_id

    Physical ID of the Ascend AI Processor, that is, the serial number of the device on the AI server.

    You can run the ls /dev/davinci* command to obtain the physical ID of the Ascend AI Processor.

    For example, /dev/davinci0 indicates that the physical ID of the Ascend AI Processor is 0.

    Value range: [0, Actual number of devices - 1]

    NOTICE:

    device_id takes precedence over the environment variable ASCEND_DEVICE_ID.

    Yes

    device_ip

    IP address of the integrated network interface card (NIC) of the Ascend AI Processor, which is globally unique and must be in the IPv4 or IPv6 format.

    You can run the cat /etc/hccn.conf command on the current AI server to obtain the IP address of the NIC.

    address_0=xx.xx.xx.xx
    netmask_0=xx.xx.xx.xx
    netdetect_0=xx.xx.xx.xx
    address_1=xx.xx.xx.xx
    netmask_1=xx.xx.xx.xx
    netdetect_1=xx.xx.xx.xx
    ...

    address_xx indicates the NIC IP address, and xx indicates the physical ID of the Ascend AI Processor (device_id). The IP address following address_xx is the actual NIC IP address of the device to be set.

    Yes

    rank_id

    ID of a rank. The value must be an integer starting from 0 and must be globally unique. The value range is [0, Total number of devices - 1].

    To facilitate management, you are advised to sort rank_id based on the physical connection sequence of devices, that is, arrange devices by physical connection proximity.

    For example, if device_ip is set in ascending order based on physical connections, you are advised to set rank_id in ascending order.

    Yes

  • Template 2 (compatible with some existing scenarios and not recommended in the new version)
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    {
    "status":"completed",   // Ranktable availability flag. The value completed indicates that the ranktable is available.
    "group_count":"1",   // Number of groups. The recommended value is 1.
    "group_list":           // List of groups
     [
       {
        "group_name":"hccl_world_group",// Group name. The recommended value is hccl_world_group.
        "instance_count":"2",       // Number of instances, which can be considered as the number of containers in the container scenario.
       "device_count":"2",         // Number of all devices in the group
        "instance_list":[           // Instance list
            {
               "pod_name":"tf-bae41",    // Instance name, which is generally the container name.
            "server_id":"node_0"   // AI server ID, which is of the string type. Ensure that the ID is globally unique.
               "devices":[                // List of devices of the instance
                {
                  "device_id":"0",           // Physical ID of the 
                  "device_ip":"192.168.1.8"  // Actual NIC IP address of the Ascend AI Processor
                }
               ]
            },
            {
                "pod_name":"tf-tbdf1",             
                "server_id":"node_1",
                "devices":[
                    {
                        "device_id":"1",
                        "device_ip":"192.168.1.9"  
                    }
                 ]
              }
           ]
       }     
     ] 
    }
    
    Table 2 Description of the ranktable file

    Configuration Option

    Description

    Required (Yes/No)

    status

    Ranktable availability flag.

    • completed: The ranktable is available.
    • initializing: The ranktable is unavailable.

    Yes

    group_count

    Number of groups that a user applies for. The recommended value is 1.

    Yes

    group_list

    Group list

    Yes

    group_name

    Group name. When group_count is set to 1, you are advised to set the value to hccl_world_group or leave it empty. In the current version, a group named hccl_world_group is created regardless of the value of this parameter.

    If multiple groups are created by using this configuration file, the system automatically combines the groups into a group named hccl_world_group.

    No

    instance_count

    The value of this parameter must be the same as the number of pod names in the instance list. For example, in the container scenario, the value is the actual number of containers.

    Yes

    device_count

    Number of devices in a group.

    Yes

    instance_list

    Instance list.

    Yes

    pod_name

    User-defined. Must be globally unique in the instance_list.

    Yes

    server_id

    AI server ID. The value is a string of less than 64 characters. Ensure that the value is globally unique.

    Example: node_0

    Yes

    devices

    Device list.

    Yes

    device_id

    Physical ID of the Ascend AI Processor, that is, the serial number of the device on the server.

    You can run the ls /dev/davinci* command to obtain the physical ID of the Ascend AI Processor.

    For example, /dev/davinci0 indicates that the physical ID of the Ascend AI Processor is 0.

    Value range: [0, Actual number of devices - 1]

    NOTICE:

    device_id takes precedence over the environment variable ASCEND_DEVICE_ID.

    Yes

    device_ip

    IP address of the integrated network interface card (NIC) of the Ascend AI Processor, which is globally unique and must be in the IPv4 or IPv6 format.

    You can run the cat /etc/hccn.conf command on the current server to obtain the IP address of the NIC.

    Yes