定位思路

HCCL-Controller主要负责生成训练任务的通讯配置文件hccl.json。

如果HCCL-Controller没有正常启动或者启动后,没有在训练任务Pod里面生成hccl.json,可以按照以下思路排查。

图1 故障定位思路
  1. “/home”路径下执行以下命令,查看HCCL-Controller的Pod状态。

    kubectl get cm

    如果HCCL-Controller的Pod的状态是正常的,但是训练任务的Pod对应ConfigMap状态是Completed。

    回显示例如下:

    NAME                     DATA   AGE
    resnet1-1-svc            3      24h
    rings-config-resnet1-1   1      24h
    root@ubuntu:/home# kubectl describe cm rings-config-resnet1-1 
    Name:         rings-config-resnet1-1
    Namespace:    default
    Labels:       ring-controller.atlas=ascend-910
    Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                    {"apiVersion":"v1","data":{"hccl.json":"{\n    \"status\":\"initializing\"\n}\n"},"kind":"ConfigMap","metadata":{"annotations":{},"labels"...
    
    Data
    ====
    hccl.json:
    ----
    {"group_list":[{"instance_list":[{"devices":[{"device_id":"0","device_ip":"192.168.100.101"}],"pod_name":"0","server_id":"90.xxx"}],
    "group_name":"default-test","device_count":"1","instance_count":"1"}],
    "status":"completed","group_count":"1"}
    Events:  <none>

    如上所示说明ConfigMap正常生成了,需要检查启动yaml中是否正确挂载ConfigMap到容器的固定路径下,参考配置如下,此处省略了无关部分。

    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: rings-config-resnet
      namespace: kube-system
      labels:
        ring-controller.atlas: ascend-910
    data:
      hccl.json: |
        {
            "status":"initializing"
        }
    ---
    apiVersion: batch.volcano.sh/v1alpha1
    kind: Job
    ...
    spec:
      ...
      tasks:
      - name: "default-test"
          ...
          spec:
            containers:
              ...
              volumeMounts:
              - name: ascend-910-config
                mountPath: /user/serverid/devindex/config
            volumes:
            - name: ascend-910-config
              configMap:
               name: rings-config-resnet

  2. 如果训练任务的Pod对应ConfigMap状态是initializing,如下所示:

    root@ubuntu:/home# kubectl get cm -A
    NAMESPACE        NAME                                 DATA   AGE
    default          resnet1-1-svc                        3      13s
    default          rings-config-resnet1-1               1      13s
    kube-public      cluster-info                         4      22h
    kube-system      calico-config                        4      22h
    kube-system      coredns                              1      22h
    ...
    root@ubuntu:/home# kubectl describe cm  rings-config-resnet1-1       
    Name:         rings-config-resnet1-1
    Namespace:    default
    Labels:       ring-controller.atlas=ascend-910
    Annotations:  kubectl.kubernetes.io/last-applied-configuration:
                  ...
    
    Data
    ====
    hccl.json:
    ----
    {"status":"initializing"}
    Events:  <none>

  3. 继续查看训练任务Pod的Annotations字段是否有ascend.kubectl.kubernetes.io/ascend-910-configuration,如果没有,那么就需要排查Ascend Device plugin的问题,一般可能是启动参数配置了“-volcanoType=false”,或者NPU芯片的device_ip没有配置。查看hccl.json文件没有生产寻找处理方法。

    root@ubuntu:/home# kubectl describe pod resnet1-1-default-test-0 
    Name:         resnet1-1-default-test-0
    Namespace:    default
    Priority:     0
    Node:         ubuntu/90.91.58.xxx
    Start Time:   Thu, 24 Dec 2020 14:12:50 -0500
    Labels:       app=tf
                  ring-controller.atlas=ascend-910
                  volcano.sh/job-name=resnet1-1
                  volcano.sh/job-namespace=default
    Annotations:  atlas.kubectl.kubernetes.io/ascend-910-configuration:
                    {"pod_name":"0","server_id":"xx.xx.xx.xx","devices":[{"device_id":"0","device_ip":"192.168.100.102"}]}
                  cni.projectcalico.org/podIP: 192.168.245.221/32
                  cni.projectcalico.org/podIPs: 192.168.245.221/32
                  huawei.com/Ascend910: Ascend910-0
                  predicate-time: 18446744073709551615
                  scheduling.k8s.io/group-name: resnet1-1
                  volcano.sh/job-name: resnet1-1
                  volcano.sh/job-version: 0
                  volcano.sh/task-spec: default-test
    ...

  4. 如果有上面的Annotations,还可以观察日志:“/var/log/mindx-dl/hccl-controller/hccl-controller.log”,也可以尝试重启Hccl-Controller和训练任务,如果不能解决问题,请联系华为工程师。