Analysis of Exposed Communication Performance Degradation

  1. Check whether the performance of communication operators deteriorates seriously.

    Open the CommunicationCompare sheet in the performance_comparison_result_.xlsx and compare the performance metrics of the following communication operators, as shown in Figure 1.

    • Operator type (such as Broadcast and AllReduce)
    • Time consumption metrics (average time consumption, maximum/minimum time consumption) and call frequency statistics
    • Information about associated subtasks (such as Reduce_Inline, Notify_Record, Notify_Wait, and Memcpy)
    Figure 1 Comparing the performance metrics of large communication operators
  2. On the OverallMetrics sheet, perform in-depth comparison and analysis by communication domain.

    Pay attention to the differences between transit_time and wait time in the same communication domain, as shown in Figure 2.

    Figure 2 Focusing on the differences of key metrics
  3. Check whether there are communication operators with deteriorated communication performance. If no, the parallelism between communication and computing is poor. Continue to analyze the cluster performance of the NPU.