Communication Is Blocked and Times Out and Task Fails After hostNetwork Is Set to true
Symptom
Set hostNetwork to true for the training job. The following error message is displayed, indicating that the communication is blocked and times out and that the task fails.


Cause Analysis
After hostNetwork is set to true, the environment variable parameter HCCL_IF_IP is not configured in the job YAML file. As a result, HCCL cannot determine the NIC IP address for communication, causing HCCL communication timeout.
Solution
After hostNetwork is set to true in the job YAML file, you need to set the environment variable HCCL_IF_IP to status.hostIP in the YAML file and specify the IP address of the root communication NIC as the host IP address. In this way, the HCCL link can be successfully established.
apiVersion: mindxdl.gitee.com/v1
kind: AscendJob
metadata:
name: default-test-mindspore
labels:
framework: mindspore # Training framework name
ring-controller.atlas: ascend-{xxx}b # Processor type used by the job
spec:
schedulerName: volcano # This field is valid when the startup parameter enableGangScheduling of Ascend Operator is set to true.
runPolicy:
schedulingPolicy: # This field is valid when the startup parameter enableGangScheduling of Ascend Operator is set to true.
minAvailable: 2 # Total number of job replicas
queue: default # Queue to which a job belongs
successPolicy: AllWorkers #Prerequisites for a successful job
replicaSpecs:
Scheduler:
replicas: 1 # Number of job replicas
restartPolicy: Never
template:
metadata:
labels:
ring-controller.atlas: ascend-{xxx}b # Processor type used by the job
spec:
hostNetwork: true # Optional. Set this parameter as required. true indicates that the hostIP can be used to create a pod, and false indicates that the hostIP cannot be used to create a pod.
affinity: # This configuration indicates that pods of a distributed job are scheduled to different nodes.
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: job-name
operator: In
values:
- default-test-mindspore # The value must be consistent with the preceding job name.
topologyKey: kubernetes.io/hostname
nodeSelector:
host-arch: huawei-arm # (Optional) Set it as required.
accelerator-type: module-{xxx}b-8 # Node type
containers:
- name: ascend # The value must be ascend and cannot be changed.
image: mindspore-test:latest #Image name
imagePullPolicy: IfNotPresent
...
env:
- name: HCCL_IF_IP # Optional. Set this parameter as required.
valueFrom: # If hostNetwork is set to true, also configure the HCCL_IF_IP environment variable.
fieldRef: # If hostNetwork is not configured or is set to false, the HCCL_IF_IP environment variable cannot be configured.
fieldPath: status.hostIP #
...
ports: # Collective communication port for distributed training
- containerPort: 2222
name: ascendjob-port
resources:
limits:
huawei.com/Ascend910: 8 # Number of applied processors
requests:
huawei.com/Ascend910: 8 #The value is the same as that of limits.
volumeMounts:
...
volumes:
...
Worker:
replicas: 1 #Number of job replicas
restartPolicy: Never
template:
metadata:
labels:
ring-controller.atlas: ascend-{xxx}b # Processor type used by the job
spec:
hostNetwork: true # Optional. Set this parameter as required. true indicates that the hostIP can be used to create a pod, and false indicates that the hostIP cannot be used to create a pod.
affinity: # This configuration indicates that pods of a distributed job are scheduled to different nodes.
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: job-name
operator: In
values:
- default-test-mindspore # The value must be consistent with the preceding job name.
topologyKey: kubernetes.io/hostname
nodeSelector:
host-arch: huawei-arm # (Optional) Set it as required.
accelerator-type: module-{xxx}b-8 # Node type
containers:
- name: ascend # The value must be ascend and cannot be changed.
...
env:
- name: HCCL_IF_IP # Optional. Set this parameter as required.
valueFrom: # If hostNetwork is set to true, also configure the HCCL_IF_IP environment variable.
fieldRef: # If hostNetwork is not configured or is set to false, the HCCL_IF_IP environment variable cannot be configured.
fieldPath: status.hostIP #
...
- name: ASCEND_VISIBLE_DEVICES # This field is used by Ascend Docker Runtime.
valueFrom:
fieldRef:
fieldPath: metadata.annotations['huawei.com/Ascend910'] # The value must be consistent with resources.requests below.
...
ports: # Collective communication port for distributed training
- containerPort: 2222
name: ascendjob-port
resources:
limits:
huawei.com/Ascend910: 8 # Number of applied processors
requests:
huawei.com/Ascend910: 8 #The value is the same as that of limits.
volumeMounts:
...
volumes:
...
Parent topic: Faults During Use