Performance Tuning Suggestions
Availability
Atlas 200/500 A2 Inference Product
Atlas Inference Series Product
Atlas Training Series Product
Atlas A2 Training Series Product/Atlas 800I A2 Inference Product
Atlas A3 Training Series Product
In cluster or multi-rank communication scenarios, performance tuning suggestions will be printed on the screen after the profile data export command is executed. The details are as follows:
- Optimization based on communication time analysis
Collective communication operators are executed synchronously. If there are slow nodes in the cluster, the performance of the entire cluster will be affected.
Suggestions:
- Check whether there is a rank whose communication operator wait time ratio (Wait Time Ratio) is greater than the upper limit (0.2) in an iteration.
- If yes, the iteration has a communication bottleneck. Refer to 1.b for further check.
- If no, it can be preliminarily determined that the iteration does not have a communication bottleneck. You can further check the overall bandwidth usage.
- Find the rank with the largest communication operator wait time ratio (Wait Time Ratio) and check whether its pre-transmission synchronization time ratio (Synchronization Time Ratio Before Transit) exceeds the upper limit (0.2).
- If yes, a slow rank (with the smallest Wait Time Ratio) exists. Check its forward and backward propagation time. If the forward and backward propagation time is much longer than that of other ranks, check whether the load is balanced and whether the processor is faulty. If the forward and backward propagation time is basically the same as that of other ranks, check the data preprocessing time.
- If no, the links are abnormal. In this case, check whether link faults occur or the traffic is too light.
- Wait Time Ratio = Wait Time/(Wait Time + Transit Time). A higher ratio indicates that wait time of a rank accounts for a larger portion of total communication time, leading to reduced efficiency.
- Synchronization Time Ratio Before Transit = Synchronization Time/(Synchronization Time + Transit Time). Synchronization Time refers to the synchronization duration before the first data transmission. A larger Synchronization Time Ratio Before Transit indicates a lower communication efficiency, which may be caused by slow ranks.
Figure 1 Suggestion based on communication time analysis
- Check whether there is a rank whose communication operator wait time ratio (Wait Time Ratio) is greater than the upper limit (0.2) in an iteration.
- Optimization based on communication matrix analysis
There are two types of slow links in a cluster.
- Some slow links increase the communication time between a few ranks, and other ranks need to wait until the communication is complete. This degrades the performance of the entire cluster.
- The bandwidth or communication operator is abnormal. As a result, the network-wide links cannot reach the normal bandwidth, and the communication time of all ranks increases. This case does not have any typical slow rank or slow link.
The communication matrix is used to analyze PCIe, HCCS, and RDMA, and provide bottleneck analysis and tuning suggestions based on the average status of each link type. If slow links exist, all information about the slow links and corresponding tuning suggestions will be provided.
The analysis and suggestions are as follows:- Time consumption ratios of the link types
- Details about each link type
- Average link information: includes the total transmission time, average bandwidth, and average large packet transmission rate. Tuning suggestions are provided based on the information.
- Slowest link information: If the link bandwidth is less than 20% of the average bandwidth, information about the slowest link is output, including the transmission duration, transmission size, transmission bandwidth, bandwidth usage, and large packet ratio. Tuning suggestions are provided based on the information.
Suggestions:- If the bandwidth usage is greater than 0.8, the bandwidth is normal and no bottleneck exists in the network-wide links. Refer to 2.b for further check.
- If the communication packet ratio is greater than 0.8, the size of each communication packet is normal. In this case, the link configuration is incorrect or link degradation occurs. Refer to 2.c for further check.
- If the size of each communication packet is too small, the bandwidth usage is low and a bandwidth bottleneck exists.
Figure 2 Suggestion based on communication matrix analysis