Tuning the Model Execution Time

Symptom

When the DeepSeek Prefill-Decode disaggregated large-scale expert parallel solution is used, the model performance deteriorates seriously.

Solution

  1. Use msServiceProfiler to collect the serving performance data of the Decode node.
  2. Analyze the collected performance data. It is found that two different servers on the same Decode node have fast and slow cards in the Decode phase. Specifically, the execution time of a single Decode on the two cards from the two servers is 260 ms and 170 ms, respectively. The difference is serious. If the MIX_AIC field is long in the operator lane of a card, the card is executing the merged compute and communication operators (MC2 operators) and the synchronization wait time is long. Therefore, the card is a fast card. The MIX_AIC field of the other card is short, indicating that the card is a slow card.
    Figure 1 Screenshot of the fast card performance data
    Figure 2 Screenshot of the slow card performance data
  3. Determine whether the performance deterioration is caused by the time difference between the model delivery and operator delivery.

    As shown in the CANN CPU lane (Figure 1 and Figure 2 are collapsed in gray), the time for starting a task on each card is different. As a result, the time for the first MC2 operator (moedispatch) in the forward operation is different, and the synchronization wait time is more than 90 ms. After the first moedispatch is removed, the remaining computing time is about 140 ms, and the operator performance of the two cards is similar.

  4. Enable performance recovery. After the distributed scheduling feature is enabled on the Decode node (environment variable MINDIE_ENABLE_DP_DISTRIBUTED is enabled, and the large-scale expert parallel solution later than MindIE 25.0.RC1 is enabled by default), the performance becomes normal.