Case: Small Communication Packets Caused by the ZeRO3 Mode
Symptom
When a model is migrated from an NPU to a GPU, the performance does not meet the requirements.
Analysis
The compare performance comparison tool in Quick Analysis for Model Tuning (msprof-analyze CLI) is used to compare the data of the baseline GPU with that of the migrated NPU. A large gap exists between the uncovered communication time, as shown in Figure 1.
According to the comparison between the timeline of the baseline GPU and that of the NPU after the migration, the difference lies in the allGather communication part at the end of the step. The NPU takes about 100 ms, and the GPU takes only about 4 ms, as shown in Figure 2.
The bandwidth of the NPU communication operators is only about 0.2 GB/s. There are a large number of allGather communication operators, and the amount of data transmitted is only 256 bytes. The communication link setup takes most of the time, as shown in Figure 3. These operators are mainly used for ZeRO3 operations.
The Zero Redundancy Optimizer (ZeRO) mode is used to save memory because it can cause a change of a communication policy. The main principle of ZeRO is to split data such as the optimizer status, gradient, and weight, and synchronize the data through collective communication when necessary to reduce the peak usage of the GPU memory. ZeRO is a typical method of trading time for space.
Troubleshooting
Change the ZeRO3 algorithm to the ZeRO2 algorithm. That is, model weights are no longer split. Although the GPU memory usage increases, the communication overhead between devices is reduced, improving the overall performance by 2.7%.


