Case 2: Long Waiting Time of the AllReduce Communication Operator in a Cluster

Symptom

The waiting time of the AllReduce communication operator in the 256P cluster is long, and the linearity does not meet the requirement.

Analysis

The Notify Wait message in the DP communication domain takes a long time, as shown in Figure 1.

Figure 1 Timeline analysis of the DP communication domain

On the Communication page, locate the communication domain. There is a long Notify Wait message, and the Elapse Time (total communication time) varies significantly. The difference between the fastest and slowest cards exceeds 200 ms.

Figure 2 Communication duration analysis

Compare the fast card 0 and slow card 208 on the Timeline page. The slow card 208 is slow in TP domain communication, as shown in Figure 3. In a micro computation task, TP domain communication takes 494 ms on slow card 208, compared to 340 ms on fast card 0. Communication transmission speeds in the TP domain are similar for both fast and slow cards. The performance difference is mainly caused by the wait time.

Figure 3 Timeline page of fast card 0 and slow card 208

Within the same TP domain as card 208, card 214 is the slow card affecting the entire TP domain. Specifically, card 214 delays card 208 in the TP domain, and card 208 subsequently impacts card 0 in the DP domain. Therefore, card 214 is identified as the root cause. Comparing the timelines of cards 208 and 214 shows that card 214 experiences slow host delivery, as shown in Figure 4.

Figure 4 Timeline of cards 208 and 214

The timeline selection statistics function is used to compare the CANN-side APIs responsible for delivering cards 208 and 214. The results show that the CANN-side API for card 214 takes significantly longer time than that for card 208, as shown in Figure 5.

Figure 5 Selected area statistics of CANN APIs on cards 208 and 214

Diagnosis Completed

Comparison of the fast and slow card timelines indicates that the difference is caused by the delivery bottleneck on the host side of the slow card (card 214). For details about how to solve the delivery bottleneck on the host side, see Host Bound Troubleshooting.