hccl.json File Description

When training starts, Ascend Operator generates the ranktable file required for collective communication of training jobs. A collective communicator is constructed based on the device ID and IP address in the ranktable file to complete information exchange in collective communication.

  • If you use Ascend Operator ConfigMap to mount the ranktable file, you need to create a ConfigMap named rings-config-<task name> in the training YAML file when creating a job, and mount the ConfigMap to the /user/serverid/devindex/config directory of the training container. Ascend Operator builds the ranktable file for collective communication based on the annotation information written in the job pod by Ascend Device Plugin, writes the content to ConfigMap, and maps the content to the /user/serverid/devindex/config/hccl.json file in the training container.
  • If you mount the ranktable file in shared storage mode, mount the shared storage or local storage directory to the training YAML file as well as to the /user/serverid/devindex/config directory of the training container when creating a job. Ascend Operator constructs the ranktable file of the job based on the annotation information written by Ascend Device Plugin or volcano-scheduler in the job pod, writes the content of the ranktable file to the /shared storage or local storage directory/hccl.json file, and maps the file to the /user/serverid/devindex/config/hccl.json file in the training container.
  • The content of the hccl.json file varies according to the product model. The details are as follows.

Atlas training product, Atlas A2 training product, Atlas 800I A2 inference server, A200I A2 Box heterogeneous component

An example of the hccl.json file is as follows:

hccl.json:
----
{
    "status": "completed",  // Whether operations by Ascend Operator are completed
    "server_list": [{    // Node list
        "device": [{   // NPU list
            "device_id": "0",  // NPU device ID
            "device_ip": "192.168.101.xx",   // Device IP address of the NPU
            "rank_id": "0" // Training rank ID corresponding to the NPU
        }, {
            "device_id": "1",
            "device_ip": "192.168.102.xx",
            "rank_id": "1"
        }, {
            "device_id": "2",
            "device_ip": "192.168.103.xx",
            "rank_id": "2"
        }, {
...
        }],
        "server_id": "xx-xx-xx-xx",   // AI server ID, which is globally unique
        "host_ip": "xx.xx.xx.xx",      // Host IP address of the AI server
        "container_ip": "192.168.149.xx",   // Pod IP address
          "hardware_type":"800I-A2-32G"     // Product model
    }]
    "server_count": "1",   // Total number of servers for the job
    "version": "1.0"
}

Atlas A3 training product

An example of the hccl.json file is as follows:

hccl.json:
----
{
    "status": "completed",  // Whether operations by Ascend Operator are completed
    "server_list": [    // Node list
        {
            "device": [
                {
                    "device_id": "0",     // NPU device ID
                    "device_ip": "xx.xx.xx.xx",  // Device IP address of the NPU
                    "super_device_id": "37748736",   // Device ID of the NPU
                    "rank_id": "0"             // Training rank ID corresponding to the NPU
                },
...
                {
                    "device_id": "7",
                    "device_ip": "xx.xx.xx.xx",
                    "super_device_id": "38600711",
                    "rank_id": "7"
                }
            ],
            "server_id": "xx-xx-xx-xx",  // AI server ID, which is globally unique
            "host_ip": "xx.xx.xx.xx",      // Host IP address of the AI server
            "container_ip": "192.168.149.xx",   // Pod IP
	    "hardware_type":"800I-A3-64G"       // Product model
        }
    ],
    "server_count": "1",
    "version": "1.2",
    "super_pod_list": [   // SuperPoD list
        {
            "super_pod_id": "0",  // Logical SuperPoD ID
            "server_list": [
                {
                    "server_id": "xx-xx-xx-xx"   // AI server ID, which is globally unique
                }
            ]
        }
    ]
}