Case 1: Continuous Performance Deterioration After Checkpoints Are Saved in a Cluster

Symptom

In a 4-node 64-card cluster, the performance deteriorates continuously after checkpoints are saved.

Analysis

The msprof-analyze tool described in Quick Analysis for Model Tuning (msprof-analyze CLI) is used to analyze the cluster. The results (Figure 1) show that the fluctuation trends of communication time and free time are negatively correlated. In other words, for the same rank, a card with longer communication time has shorter free time, while a card with shorter communication time has longer free time. Based on this observation, card 0, which has the shortest free time, finishes its computation first and waits for the other cards, indicating that it is the fast card. In contrast, card 1, which has the longest free time, finishes last and is identified as the slow card. In this case, the fast and slow card issue is caused by performance fluctuation on the host side.

Figure 1 cluster_step_trace_time.csv deliverable

Go to the Timeline page. As shown in Figure 2 and Figure 3, the communication waiting of card 0 occurs in the gradient summary phase after backpropagation. Compare the fast and slow cards on the Timeline. An abnormal gap is observed at the end of the step for rank 1 of the slow card. During this phase, frame freezing occurs, causing rank 0 of the fast card to wait.

Figure 2 Timeline view (rank 0)

Figure 3 Timeline view (rank 1)

Diagnosis Completed

The cause of this problem is that the code has unreleased memory at the end of the slow card step. After the memory is manually cleared, the problem is resolved.

Parent topic: Fast and Slow Card Cases