Failed to Generate the hccl.json File

Symptom

After a training job is started, the hccl.json file in the training job container is in the initializing state. The default file path is /user/serverid/devindex/config/hccl.json.

Run the kubectl exec -it XXX bash command to access the container. If the pod is not in the default namespace, add -n XXX to specify the namespace, for example, kubectl exec -it XXX -n XXX bash.

Cause Analysis

  • Cause 1: The Ascend Operator is not started properly.
  • Cause 2: The startup parameter -volcanoType of the Ascend Device Plugin is set to false. You can run the following command to check whether this parameter is set to false:
    ps -ef | grep "device-plugin"
  • Cause 3: If the log of the Ascend Device Plugin component contains the following information, the Ascend Device Plugin fails to obtain the correct device IP address. As a result, the pod annotations cannot be written.
    Get device ip failed

Solution

For cause 1, reinstall Ascend Operator by referring to Installing Ascend Operator.

For cause 2, reset the startup parameter -volcanoType of the Ascend Device Plugin to true by referring to Ascend Device Plugin, and then apply the corresponding YAML file again.

For cause 3, configure the device IP address correctly. For details, see "Using HCCN Tool" in MindCluster Ascend Deployer User Guide.