Preparing the Ranktable Resource Configuration File
You can use the ranktable file in JSON format to configure NPU resource information for collective communication. In this file, you can configure all NPU resource information so that the specified NPU resources can be used when the process is started.
The comments of examples shown in this section in the JSON file are for reference only. Delete the comments in the JSON file in actual use.
Configuration Description of the Atlas Training Series Product
- (Recommended) Template 1
The following is an example of configuring the ranktable file when there are two AI servers and each AI server has two devices:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
{ "status":"completed", // Ranktable availability flag. The value completed indicates that the ranktable is available. "version":"1.0", // Ranktable template version information. Set this option to 1.0. "server_count":"2", // Number of AI servers participating in training. In this example, there are two AI servers. "server_list": [ { "server_id":"node_0", // AI server ID, which is of the string type. Ensure that the ID is globally unique. "device":[ // List of devices on the AI server { "device_id":"0", // Physical ID of the device "device_ip":"192.168.1.8", // Actual NIC IP address of the device "rank_id":"0" // Rank ID, which starts from 0 and must be globally unique. }, { "device_id":"1", "device_ip":"192.168.1.9", "rank_id":"1" } ] }, { "server_id":"node_1", "device":[ { "device_id":"0", "device_ip":"192.168.2.8", "rank_id":"2" }, { "device_id":"1", "device_ip":"192.168.2.9", "rank_id":"3" } ] } ] }
Table 1 Description of the ranktable file Configuration Option
Description
Required (Yes/No)
status
Ranktable availability flag.
- completed: The ranktable is available.
- initializing: The ranktable is unavailable.
Yes
version
Version of a ranktable template.
Set the value to 1.0.
Yes
server_count
Number of AI servers that participate in the training.
Yes
server_list
List of AI servers that participate in the training.
Yes
server_id
AI server ID. The value is a string of less than 64 characters. Ensure that the value is globally unique.
Example: node_0
Yes
device_id
Physical ID of the Ascend AI Processor, that is, the serial number of the device on the AI server.
You can run the ls /dev/davinci* command to obtain the physical ID of the Ascend AI Processor.
For example, /dev/davinci0 indicates that the physical ID of the Ascend AI Processor is 0.
Value range: [0, Actual number of devices - 1]
NOTICE:device_id takes precedence over the environment variable ASCEND_DEVICE_ID.
Yes
device_ip
IP address of the integrated network interface card (NIC) of the Ascend AI Processor, which is globally unique and must be in the IPv4 or IPv6 format.
You can run the cat /etc/hccn.conf command on the current AI server to obtain the IP address of the NIC.
address_0=xx.xx.xx.xx netmask_0=xx.xx.xx.xx netdetect_0=xx.xx.xx.xx address_1=xx.xx.xx.xx netmask_1=xx.xx.xx.xx netdetect_1=xx.xx.xx.xx ...
address_xx indicates the NIC IP address, and xx indicates the physical ID of the Ascend AI Processor (device_id). The IP address following address_xx is the actual NIC IP address of the device to be set.
Yes
rank_id
ID of a rank. The value must be an integer starting from 0 and must be globally unique. The value range is [0, Total number of devices - 1].
To facilitate management, you are advised to sort rank_id based on the physical connection sequence of devices, that is, arrange devices by physical connection proximity.
For example, if device_ip is set in ascending order based on physical connections, you are advised to set rank_id in ascending order.
Yes
- Template 2 (compatible with some existing scenarios and not recommended in the new version)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
{ "status":"completed", // Ranktable availability flag. The value completed indicates that the ranktable is available. "group_count":"1", // Number of groups. The recommended value is 1. "group_list": // List of groups [ { "group_name":"hccl_world_group",// Group name. The recommended value is hccl_world_group. "instance_count":"2", // Number of instances, which can be considered as the number of containers in the container scenario. "device_count":"2", // Number of all devices in the group "instance_list":[ // Instance list { "pod_name":"tf-bae41", // Instance name, which is generally the container name. "server_id":"node_0" // AI server ID, which is of the string type. Ensure that the ID is globally unique. "devices":[ // List of devices of the instance { "device_id":"0", // Physical ID of the "device_ip":"192.168.1.8" // Actual NIC IP address of the Ascend AI Processor } ] }, { "pod_name":"tf-tbdf1", "server_id":"node_1", "devices":[ { "device_id":"1", "device_ip":"192.168.1.9" } ] } ] } ] }
Table 2 Description of the ranktable file Configuration Option
Description
Required (Yes/No)
status
Ranktable availability flag.
- completed: The ranktable is available.
- initializing: The ranktable is unavailable.
Yes
group_count
Number of groups that a user applies for. The recommended value is 1.
Yes
group_list
Group list
Yes
group_name
Group name. When group_count is set to 1, you are advised to set the value to hccl_world_group or leave it empty. In the current version, a group named hccl_world_group is created regardless of the value of this parameter.
If multiple groups are created by using this configuration file, the system automatically combines the groups into a group named hccl_world_group.
No
instance_count
The value of this parameter must be the same as the number of pod names in the instance list. For example, in the container scenario, the value is the actual number of containers.
Yes
device_count
Number of devices in a group.
Yes
instance_list
Instance list.
Yes
pod_name
User-defined. Must be globally unique in the instance_list.
Yes
server_id
AI server ID. The value is a string of less than 64 characters. Ensure that the value is globally unique.
Example: node_0
Yes
devices
Device list.
Yes
device_id
Physical ID of the Ascend AI Processor, that is, the serial number of the device on the server.
You can run the ls /dev/davinci* command to obtain the physical ID of the Ascend AI Processor.
For example, /dev/davinci0 indicates that the physical ID of the Ascend AI Processor is 0.
Value range: [0, Actual number of devices - 1]
NOTICE:device_id takes precedence over the environment variable ASCEND_DEVICE_ID.
Yes
device_ip
IP address of the integrated network interface card (NIC) of the Ascend AI Processor, which is globally unique and must be in the IPv4 or IPv6 format.
You can run the cat /etc/hccn.conf command on the current server to obtain the IP address of the NIC.
Yes