Fault Locating

The HCCL-Controller generates the communication configuration file hccl.json for training jobs.

If the HCCL-Controller is not started properly or hccl.json is not generated in the training job pod after the HCCL-Controller is started, perform the following steps to locate the fault:

Figure 1 Process of locating and rectifying the fault
  1. Check the pod status of the HCCL-Controller.

    If the pod status of the HCCL-Controller is normal but the ConfigMap status of the pod of the training job is Completed:

    Example:

    root@ubuntu:/home# kubectl get cm 
    NAME                     DATA   AGE
    resnet1-1-svc            3      24h
    rings-config-resnet1-1   1      24h
    root@ubuntu:/home# kubectl describe cm rings-config-resnet1-1 
    Name:         rings-config-resnet1-1
    Namespace:    default
    Labels:       ring-controller.atlas=ascend-910
    Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                    {"apiVersion":"v1","data":{"hccl.json":"{\n    \"status\":\"initializing\"\n}\n"},"kind":"ConfigMap","metadata":{"annotations":{},"labels"...
    
    Data
    ====
    hccl.json:
    ----
    {"group_list":[{"instance_list":[{"devices":[{"device_id":"0","device_ip":"192.168.100.101"}],"pod_name":"0","server_id":"90.xxx"}],
    "group_name":"default-test","device_count":"1","instance_count":"1"}],
    "status":"completed","group_count":"1"}
    Events:  <none>

    The ConfigMap is generated properly. You need to check whether the ConfigMap is correctly mounted to the fixed path of the container in the startup YAML file. The reference configuration is as follows (unnecessary information is omitted here):

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: rings-config-resnet
      namespace: kube-system
      labels:
        ring-controller.atlas: ascend-910
    data:
      hccl.json: |
        {
            "status":"initializing"
        }
    ---
    apiVersion: batch.volcano.sh/v1alpha1
    kind: Job
    ...
    spec:
      ...
      tasks:
      - name: "default-test"
          ...
          spec:
            containers:
              ...
              volumeMounts:
              - name: ascend-910-config
                mountPath: /user/serverid/devindex/config
            volumes:
            - name: ascend-910-config
              configMap:
               name: rings-config-resnet
  2. If the ConfigMap status of the pod of the training job is initializing:
    root@ubuntu:/home# kubectl get cm -A
    NAMESPACE        NAME                                 DATA   AGE
    default          resnet1-1-svc                        3      13s
    default          rings-config-resnet1-1               1      13s
    kube-public      cluster-info                         4      22h
    kube-system      calico-config                        4      22h
    kube-system      coredns                              1      22h
    ...
    root@ubuntu:/home# kubectl describe cm  rings-config-resnet1-1       
    Name:         rings-config-resnet1-1
    Namespace:    default
    Labels:       ring-controller.atlas=ascend-910
    Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                  ...
    
    Data
    ====
    hccl.json:
    ----
    {"status":"initializing"}
    Events:  <none>
  3. Check whether the Annotations field of the training job's pod contains ascend.kubectl.kubernetes.io/ascend-910-configuration. If no, check the Ascend Device Plugin. Generally, the possible cause is that the startup parameter -volcanoType is set to false or the device_ip of the NPU is not configured. View the Failed to Generate the hccl.json File file to find a solution.
    root@ubuntu:/home# kubectl describe pod resnet1-1-default-test-0 
    Name:         resnet1-1-default-test-0
    Namespace:    default
    Priority:     0
    Node:         ubuntu/90.91.58.xxx
    Start Time:   Thu, 24 Dec 2020 14:12:50 -0500
    Labels:       app=tf
                  ring-controller.atlas=ascend-910
                  volcano.sh/job-name=resnet1-1
                  volcano.sh/job-namespace=default
    Annotations:  atlas.kubectl.kubernetes.io/ascend-910-configuration:
                    {"pod_name":"0","server_id":"90.91.58.153","devices":[{"device_id":"0","device_ip":"192.168.100.102"}]}
                  cni.projectcalico.org/podIP: 192.168.245.221/32
                  cni.projectcalico.org/podIPs: 192.168.245.221/32
                  huawei.com/Ascend910: Ascend910-0
                  predicate-time: 18446744073709551615
                  scheduling.k8s.io/group-name: resnet1-1
                  volcano.sh/job-name: resnet1-1
                  volcano.sh/job-version: 0
                  volcano.sh/task-spec: default-test
    ...
  4. If the preceding Annotations field exists, you can also view the /var/log/mindx-dl/hccl-controller/hccl-controller.log file or restart the HCCL-Controller and training jobs. If the fault persists, contact Huawei technical support.