Fault Locating
The HCCL-Controller generates the communication configuration file hccl.json for training jobs.
If the HCCL-Controller is not started properly or hccl.json is not generated in the training job pod after the HCCL-Controller is started, perform the following steps to locate the fault:
Figure 1 Process of locating and rectifying the fault


- Check the pod status of the HCCL-Controller.
If the pod status of the HCCL-Controller is normal but the ConfigMap status of the pod of the training job is Completed:
Example:
root@ubuntu:/home# kubectl get cm NAME DATA AGE resnet1-1-svc 3 24h rings-config-resnet1-1 1 24h root@ubuntu:/home# kubectl describe cm rings-config-resnet1-1 Name: rings-config-resnet1-1 Namespace: default Labels: ring-controller.atlas=ascend-910 Annotations: kubectl.kubernetes.io/last-applied-configuration: {"apiVersion":"v1","data":{"hccl.json":"{\n \"status\":\"initializing\"\n}\n"},"kind":"ConfigMap","metadata":{"annotations":{},"labels"... Data ==== hccl.json: ---- {"group_list":[{"instance_list":[{"devices":[{"device_id":"0","device_ip":"192.168.100.101"}],"pod_name":"0","server_id":"90.xxx"}], "group_name":"default-test","device_count":"1","instance_count":"1"}], "status":"completed","group_count":"1"} Events: <none>The ConfigMap is generated properly. You need to check whether the ConfigMap is correctly mounted to the fixed path of the container in the startup YAML file. The reference configuration is as follows (unnecessary information is omitted here):
apiVersion: v1 kind: ConfigMap metadata: name: rings-config-resnet namespace: kube-system labels: ring-controller.atlas: ascend-910 data: hccl.json: | { "status":"initializing" } --- apiVersion: batch.volcano.sh/v1alpha1 kind: Job ... spec: ... tasks: - name: "default-test" ... spec: containers: ... volumeMounts: - name: ascend-910-config mountPath: /user/serverid/devindex/config volumes: - name: ascend-910-config configMap: name: rings-config-resnet - If the ConfigMap status of the pod of the training job is initializing:
root@ubuntu:/home# kubectl get cm -A NAMESPACE NAME DATA AGE default resnet1-1-svc 3 13s default rings-config-resnet1-1 1 13s kube-public cluster-info 4 22h kube-system calico-config 4 22h kube-system coredns 1 22h ... root@ubuntu:/home# kubectl describe cm rings-config-resnet1-1 Name: rings-config-resnet1-1 Namespace: default Labels: ring-controller.atlas=ascend-910 Annotations: kubectl.kubernetes.io/last-applied-configuration: ... Data ==== hccl.json: ---- {"status":"initializing"} Events: <none> - Check whether the Annotations field of the training job's pod contains ascend.kubectl.kubernetes.io/ascend-910-configuration. If no, check the Ascend Device Plugin. Generally, the possible cause is that the startup parameter -volcanoType is set to false or the device_ip of the NPU is not configured. View the Failed to Generate the hccl.json File file to find a solution.
root@ubuntu:/home# kubectl describe pod resnet1-1-default-test-0 Name: resnet1-1-default-test-0 Namespace: default Priority: 0 Node: ubuntu/90.91.58.xxx Start Time: Thu, 24 Dec 2020 14:12:50 -0500 Labels: app=tf ring-controller.atlas=ascend-910 volcano.sh/job-name=resnet1-1 volcano.sh/job-namespace=default Annotations: atlas.kubectl.kubernetes.io/ascend-910-configuration: {"pod_name":"0","server_id":"90.91.58.153","devices":[{"device_id":"0","device_ip":"192.168.100.102"}]} cni.projectcalico.org/podIP: 192.168.245.221/32 cni.projectcalico.org/podIPs: 192.168.245.221/32 huawei.com/Ascend910: Ascend910-0 predicate-time: 18446744073709551615 scheduling.k8s.io/group-name: resnet1-1 volcano.sh/job-name: resnet1-1 volcano.sh/job-version: 0 volcano.sh/task-spec: default-test ... - If the preceding Annotations field exists, you can also view the /var/log/mindx-dl/hccl-controller/hccl-controller.log file or restart the HCCL-Controller and training jobs. If the fault persists, contact Huawei technical support.
Parent topic: HCCL-Controller