Low Peak Bandwidth of Multiple Servers
Symptom
When the HCCL Test tool is used to test the bandwidth of multiple servers, the peak bandwidth of multiple servers is lower than the expected value.
Possible Cause
- When the HCCL Test tool is used, the HCCL_BUFFSIZE environment variable is not set, or the test data size set in the test command is small.
- If the profiling function is enabled or the log level is not set to the default ERROR level, the bandwidth is low.
- The bandwidth of a certain server is relatively low, resulting in low bandwidth of multiple servers.
- The load balancing mode of the switch is incorrectly configured, which causes congestion.
- A port between the spine and leaf switches is abnormal.
Solution
- Check whether HCCL_BUFFSIZE is set.
The environment variable HCCL_BUFFSIZE is used to adjust the size of the shared data buffer between two NPUs. The default size of the shared data buffer between two NPUs is 200 MB. When the HCCL Test tool is used to perform a performance test, the communication data size is large. In this scenario, you can increase the value of HCCL_BUFFSIZE to improve the data communication efficiency and bandwidth.
Configuration example:
export HCCL_BUFFSIZE=2048
- Check whether the value of -e in the hccl_test command is too small.
-e indicates the end value of the test data size. If the value of -e is small, the bandwidth is low. You are advised to increase the value of -e. For example:
mpirun -n 8 ./bin/all_reduce_test -b 8K -e 4G -f 2 -d fp32 -o sum -p 8
- Check whether profile data collection is enabled.
- Check whether the log level is ERROR.Check the log levels of the host and device.If the log level is not ERROR by default, run the following commands to change the log levels of the host and device to ERROR:
- The low bandwidth of multiple servers may be caused by the low bandwidth of a certain server or inconsistent network configurations.
In this scenario, you can use the dichotomy to find the server and locate the possible cause by referring to Low Bandwidth of a Single Server. If no fault is found in the single-server test, run the cat /etc/hccn.conf command to check whether the network configurations of all servers are the same. If the network configurations of a server are different, the single-server test may be normal (the single server does not use the external network), but the bandwidth of multiple servers is low.
- Check whether the load balancing mode of the switch is properly configured.
Run the following command to view the server statistics:
for i in $(seq 0 15); do echo "==============> $i"; hccn_tool -i $i -stat -g |grep pfc ;done
The statistics contain lots of "rx pfc", which indicates that the load balancing of the switch is unbalanced and congestion occurs.
You can try the following methods to solve the problem:
- First, solve the "pfc" problem of the switch. Ensure that there are few or even no "pfc".
- Check whether the traffic on the switch is unbalanced. Check whether multiple channels of traffic flow out through a single port. If the traffic on some ports is heavy and the traffic on some other ports is light, congestion occurs.
- If hash routing is performed based on the UDP port, check whether the UDP ports of some servers are not configured.