Initialization Connection Failed due to Incorrect sp-block Setting
Symptom
In the unified bus device environment, if sp-block is set to 32, training functions properly. If sp-block is set to 16, training fails to be completed, and the training container reports an error, indicating that the initialization connection fails, as shown in the following figure.


Cause Analysis
When sp-block is set to 32, the logical SuperPoD divided by the unified bus device contains two compute nodes. The total number of processors allocated to a job is 32, meaning this job has one logical SuperPoD.
When sp-block is set to 16, the logical SuperPoD divided by the unified bus device contains one compute node. The total number of processors allocated to a job is 32, meaning this job has two logical SuperPoDs.
For a unified bus device, compute nodes in different logical SuperPoDs communicate with each other over the RoCE network, while compute nodes within the same logical SuperPoD use HCCS for communication.
According to the preceding analysis, the possible cause is that when sp-block is set to 32, only one logical SuperPoD is allocated to a job, and compute nodes within the logical SuperPoD use HCCS for communication. When sp-block is set to 16, two logical SuperPoDs are allocated to a job, but compute nodes in different logical SuperPoDs use RoCE for communication. As a result, underlying RoCE network is not connected, leading to a job initialization failure when sp-block is set to 16.
Solution
Use hccn_tool to check whether the RoCE network between two compute nodes is connected as follows.
- Obtain the IP address of NPU 0.
hccn_tool -i 0 -ip -g
- Ping compute nodes in another r logical SuperPoD.
hccn_tool -i 0 -ping -g address {IP address of the NPU on any compute node in another logical SuperPoD}- If it can be pinged, the RoCE network is normal. In this case, check whether other problems occur based on the log information.
- If the message "3 packets transmitted, 0 received, 100.00% packet loss" is displayed, the RoCE network is faulty. In this case, solve the RoCE network connectivity problem between compute nodes.
