Byte Misalignment Between the Source and Destination Addresses

Byte misalignment occurs when the source and destination addresses are not aligned by a specific number of bytes (for example, 512 bytes) during cluster communication, significantly reducing transmission bandwidth and affecting communication performance. This issue commonly arises in SDMA transmissions (intra-node communication) and often appears in the ZeRO algorithm. Proper data padding for address alignment can resolve the issue and improve communication performance.

Symptom

The performance of a single node with eight cards is lower than expected. According to the communication matrix analysis on MindStudio Insight, the average bandwidth is only 1 to 2 GB/s, which is far lower than the empirical value, as shown in Figure 1.
Figure 1 Cluster communication matrix analysis

Analysis

An allGather operator takes more than 300 ms, and the bandwidth is only 0.1 GB/s, as shown in Figure 2.
Figure 2 allGather operator timeline details

According to the analysis of HCCL experts, the source address and destination address of the allGather communication operator in the DP communicator cannot be aligned. As a result, the communication performance deteriorates seriously.

Troubleshooting

Perform byte-aligned padding for the allGather bucket in the DP communication domain. Currently, AscendSpeed has adapted to byte-aligned padding for the allGather bucket in the DP communication domain. DeepSpeed and Megatron have not been modified. You need to modify nccl_start_alignment_factor in the DeepSpeed source code, as shown in Figure 3. After the modification, the allGather duration is changed from 350 ms to 50 ms.

Figure 3 Modifying nccl_start_alignment_factor to enable byte alignment