global-ranktable Description

ClusterD listens to the information of the MS Controller and MS Coordinator job pods and the changes of the ConfigMap corresponding to each hccl.json file, and generates global-ranktable in real time. Some fields in global-ranktable are consistent with those in the hccl.json file. For details about hccl.json, see hccl.json File Description.

  • Example of global-ranktable of the Atlas A2 training product
    {
        "version": "1.0",
        "status": "completed",
        "server_group_list": [
            {
                "group_id": "2",
                "deploy_server": "0",
                "server_count": "1",
                "server_list": [
                    {
                        "device": [
                            {
                                "device_id": "x",
                                "device_ip": "xx.xx.xx.xx",
                                "device_logical_id": "x",
                                "rank_id": "x"
                            }
                        ],
                        "server_id": "xx.xx.xx.xx",
                        "server_ip": "xx.xx.xx.xx"
                    }
                ]
            }
        ]
    }
  • Example of global-ranktable of Atlas A3 training product
    {
        "version": "1.2",
        "status": "completed",
        "server_group_list": [
            {
                "group_id": "2",
                "deploy_server": "1",
                "server_count": "1",
                "server_list": [
                    {
                        "device": [
                            {
                                "device_id": "0",
                                "device_ip": "xx.xx.xx.xx",
                                "super_device_id": "xxxxx",
                                "device_logical_id": "0",
                                "rank_id": "0"
                            }
                        ],
                        "server_id": "xx.xx.xx.xx",
                        "server_ip": "xx.xx.xx.xx"
                    }
                ],
                "super_pod_list": [
                    {
                        "super_pod_id": "0",
                        "server_list": [
                            {
                                "server_id": "xx.xx.xx.xx"
                            }
                        ]
                    }
                ]
            }
        ]
    }
Table 1 Fields in global-ranktable

Field

Description

version

Version

status

Status

server_group_list

List of server groups

group_id

Group ID

server_count

Number of servers

server_list

Server list

server_id

AI server ID, which is globally unique.

server_ip

Pod IP

device_id

NPU device ID

device_ip

NPU device IP

super_device_id

Unique NPU ID of in a SuperPoD of Atlas A3 training product

rank_id

Training rank ID of the NPU

device_logical_id

Logical NPU ID

super_pod_list

SuperPoD list

super_pod_id

Logical SuperPoD ID