Failed to Generate the hccl.json File
Symptom
After a training job is started, the hcl.json file in the training job container is in the initializing state. The default file path is /user/serverid/devindex/config/hccl.json.
Run the kubectl exec -it XXX bash command to access the container. If the pod is not in the default namespace, add -n XXX to specify the namespace, for example, kubectl exec -it XXX -n XXX bash.

Causes
- Cause 1: The HCCL-Controller is not started properly.
- Cause 2: The startup parameter -volcanoType is set to false for the Ascend Device Plugin. The following command is used to check the parameter setting:
ps -ef | grep "device-plugin"
- Cause 3: The Ascend Device Plugin fails to obtain the correct device IP address. As a result, the annotations of the pod cannot be written. You can check the component logs to locate the fault. The following information is displayed in the log:
Get device ip failed
Solution
For cause 1: Reinstall the HCCL-Controller by referring to the MindX DL Cluster Scheduling User Guide.
For cause 2, reset the startup parameter -volcanoType of the Ascend Device Plugin to true by referring to the MindX DL Cluster Scheduling User Guide, and then apply the corresponding YAML file again.
For cause 3, ensure that the device IP address is correctly configured. For details, see "Development Environment Installation (Training) > Changing NPU IP Addresses" in the CANN Software Installation Guide.