Communication Is Blocked and Times Out and Task Fails After hostNetwork Is Set to true

Symptom

Set hostNetwork to true for the training job. The following error message is displayed, indicating that the communication is blocked and times out and that the task fails.

Cause Analysis

After hostNetwork is set to true, the environment variable parameter HCCL_IF_IP is not configured in the job YAML file. As a result, HCCL cannot determine the NIC IP address for communication, causing HCCL communication timeout.

Solution

After hostNetwork is set to true in the job YAML file, you need to set the environment variable HCCL_IF_IP to status.hostIP in the YAML file and specify the IP address of the root communication NIC as the host IP address. In this way, the HCCL link can be successfully established.

apiVersion: mindxdl.gitee.com/v1
kind: AscendJob
metadata:
  name: default-test-mindspore
  labels:
    framework: mindspore     # Training framework name
    ring-controller.atlas: ascend-{xxx}b  # Processor type used by the job
spec:
  schedulerName: volcano    # This field is valid when the startup parameter enableGangScheduling of Ascend Operator is set to true.
  runPolicy:
    schedulingPolicy:      # This field is valid when the startup parameter enableGangScheduling of Ascend Operator is set to true.
      minAvailable: 2  # Total number of job replicas
     queue: default      # Queue to which a job belongs
  successPolicy: AllWorkers  #Prerequisites for a successful job
  replicaSpecs:
    Scheduler:
      replicas: 1   # Number of job replicas
      restartPolicy: Never
      template:
        metadata:
          labels:
            ring-controller.atlas: ascend-{xxx}b  # Processor type used by the job
        spec:
          hostNetwork: true    # Optional. Set this parameter as required. true indicates that the hostIP can be used to create a pod, and false indicates that the hostIP cannot be used to create a pod.
          affinity:                                         # This configuration indicates that pods of a distributed job are scheduled to different nodes.
            podAntiAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                - labelSelector:
                    matchExpressions:
                      - key: job-name
                        operator: In
                        values:
                          - default-test-mindspore         # The value must be consistent with the preceding job name.
                  topologyKey: kubernetes.io/hostname
          nodeSelector:
            host-arch: huawei-arm              # (Optional) Set it as required.
            accelerator-type: module-{xxx}b-8   # Node type
          containers:
          - name: ascend                                     # The value must be ascend and cannot be changed.
            image: mindspore-test:latest  #Image name
            imagePullPolicy: IfNotPresent
...
            env:                                    
              - name: HCCL_IF_IP                    # Optional. Set this parameter as required.
                valueFrom:                          # If hostNetwork is set to true, also configure the HCCL_IF_IP environment variable.
                  fieldRef:                         # If hostNetwork is not configured or is set to false, the HCCL_IF_IP environment variable cannot be configured.
                    fieldPath: status.hostIP        # 
...            
            ports:                          # Collective communication port for distributed training
              - containerPort: 2222         
                name: ascendjob-port
            resources:
              limits:
                huawei.com/Ascend910: 8 # Number of applied processors
              requests:
                huawei.com/Ascend910: 8 #The value is the same as that of limits.
            volumeMounts:
...            
          volumes:
...            
    Worker:
      replicas: 1   #Number of job replicas
      restartPolicy: Never
      template:
        metadata:
          labels:
            ring-controller.atlas: ascend-{xxx}b  # Processor type used by the job
        spec:
          hostNetwork: true    # Optional. Set this parameter as required. true indicates that the hostIP can be used to create a pod, and false indicates that the hostIP cannot be used to create a pod.
          affinity:            # This configuration indicates that pods of a distributed job are scheduled to different nodes.
            podAntiAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                - labelSelector:
                    matchExpressions:
                      - key: job-name
                        operator: In
                        values:
                          - default-test-mindspore        # The value must be consistent with the preceding job name.
                  topologyKey: kubernetes.io/hostname
          nodeSelector:
            host-arch: huawei-arm              # (Optional) Set it as required.
            accelerator-type: module-{xxx}b-8  # Node type
          containers:
          - name: ascend                            # The value must be ascend and cannot be changed.
...
            env:                                    
              - name: HCCL_IF_IP                    # Optional. Set this parameter as required.
                valueFrom:                          # If hostNetwork is set to true, also configure the HCCL_IF_IP environment variable.
                  fieldRef:                         # If hostNetwork is not configured or is set to false, the HCCL_IF_IP environment variable cannot be configured.
                    fieldPath: status.hostIP        # 
...
          - name: ASCEND_VISIBLE_DEVICES                       # This field is used by Ascend Docker Runtime.
            valueFrom:
              fieldRef:
                fieldPath: metadata.annotations['huawei.com/Ascend910']                # The value must be consistent with resources.requests below.
...
            ports:                          # Collective communication port for distributed training
              - containerPort: 2222         
                name: ascendjob-port
            resources:
              limits:
                huawei.com/Ascend910: 8 # Number of applied processors
              requests:
                huawei.com/Ascend910: 8 #The value is the same as that of limits.
            volumeMounts:
...
          volumes:
...