HCCL-Controller主要负责生成训练任务的通讯配置文件hccl.json。
如果HCCL-Controller没有正常启动或者启动后,没有在训练任务Pod里面生成hccl.json,可以按照以下思路排查。
kubectl get cm
如果HCCL-Controller的Pod的状态是正常的,但是训练任务的Pod对应ConfigMap状态是Completed。
回显示例如下:
NAME DATA AGE resnet1-1-svc 3 24h rings-config-resnet1-1 1 24h root@ubuntu:/home# kubectl describe cm rings-config-resnet1-1 Name: rings-config-resnet1-1 Namespace: default Labels: ring-controller.atlas=ascend-910 Annotations: kubectl.kubernetes.io/last-applied-configuration: {"apiVersion":"v1","data":{"hccl.json":"{\n \"status\":\"initializing\"\n}\n"},"kind":"ConfigMap","metadata":{"annotations":{},"labels"... Data ==== hccl.json: ---- {"group_list":[{"instance_list":[{"devices":[{"device_id":"0","device_ip":"192.168.100.101"}],"pod_name":"0","server_id":"90.xxx"}], "group_name":"default-test","device_count":"1","instance_count":"1"}], "status":"completed","group_count":"1"} Events: <none>
如上所示说明ConfigMap正常生成了,需要检查启动yaml中是否正确挂载ConfigMap到容器的固定路径下,参考配置如下,此处省略了无关部分。
apiVersion: v1 kind: ConfigMap metadata: name: rings-config-resnet namespace: kube-system labels: ring-controller.atlas: ascend-910 data: hccl.json: | { "status":"initializing" } --- apiVersion: batch.volcano.sh/v1alpha1 kind: Job ... spec: ... tasks: - name: "default-test" ... spec: containers: ... volumeMounts: - name: ascend-910-config mountPath: /user/serverid/devindex/config volumes: - name: ascend-910-config configMap: name: rings-config-resnet
root@ubuntu:/home# kubectl get cm -A NAMESPACE NAME DATA AGE default resnet1-1-svc 3 13s default rings-config-resnet1-1 1 13s kube-public cluster-info 4 22h kube-system calico-config 4 22h kube-system coredns 1 22h ... root@ubuntu:/home# kubectl describe cm rings-config-resnet1-1 Name: rings-config-resnet1-1 Namespace: default Labels: ring-controller.atlas=ascend-910 Annotations: kubectl.kubernetes.io/last-applied-configuration: ... Data ==== hccl.json: ---- {"status":"initializing"} Events: <none>
root@ubuntu:/home# kubectl describe pod resnet1-1-default-test-0 Name: resnet1-1-default-test-0 Namespace: default Priority: 0 Node: ubuntu/90.91.58.xxx Start Time: Thu, 24 Dec 2020 14:12:50 -0500 Labels: app=tf ring-controller.atlas=ascend-910 volcano.sh/job-name=resnet1-1 volcano.sh/job-namespace=default Annotations: atlas.kubectl.kubernetes.io/ascend-910-configuration: {"pod_name":"0","server_id":"xx.xx.xx.xx","devices":[{"device_id":"0","device_ip":"192.168.100.102"}]} cni.projectcalico.org/podIP: 192.168.245.221/32 cni.projectcalico.org/podIPs: 192.168.245.221/32 huawei.com/Ascend910: Ascend910-0 predicate-time: 18446744073709551615 scheduling.k8s.io/group-name: resnet1-1 volcano.sh/job-name: resnet1-1 volcano.sh/job-version: 0 volcano.sh/task-spec: default-test ...