Overflow or NaN

Before using the tool for fault locating, rectify the configuration item problems in Checking the Environment and the randomness problems in Reproducing an Issue.

Case 1

When a visual model is migrated from the GPU to the NPU MindSpeed LLM for training, gradient overflow occurs immediately.

Figure 1 Log printed for gradient overflow

As shown in the training screenshot shared by the user, the gradient increases layer by layer until overflow occurs during gradient backpropagation in step 0.

Locating method:

  1. Use the dump tool to collect the mix-level data of step 0 (overflow step). The config.json file is as follows:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    {
        "task": "statistics",
        "dump_path": "/home/data_dump",
        "rank": [],
        "step": [0],
        "level": "mix",
        "enable_dataloader": false,
        "statistics": {
            "scope": [], 
            "list": [],
            "data_mode": ["all"],
            "summary_mode": "statistics"
        }
    }
    

    As shown in the training screenshot, the gradient increases layer by layer after each self_attn backpropagation. The self_attn code uses the npu_fusion_attention operator, which often leads to accuracy issues when misused. By examining the backpropagation data in the dump file, there is a sharp rise in the norm value after every npu_fusion_attention layer backpropagation.

    Figure 2 npu_fusion_attention results collected by dump
  2. Check quickly: Exclude npu_fusion_attention fused operators from the MindSpeed LLM training setup.

    Delete the hyperparameter -use_fused_attn.

    The overflow disappears. It is confirmed that the issue is introduced by the npu_fusion_attention branch, but the performance deteriorates significantly. Further investigation is needed to determine the cause of the npu_fusion_attention accuracy error.

  3. Read npu_fusion_attention to analyze its usage in the code.

    This issue occurs in the variable-length scenario. The original input is batch size = 2, seqlen = 3577 in sample 1, and seqlen = 1507 in sample 2. The input is padded to 3577. The original input shape is [2, 3577, 32, 128].

    Before npu_fusion_attention calculation, flatten batch size and sequence length. The shape is [7154, 32, 128]. Remove the padding, so the input length of Q and KV becomes [5079, 32, 128].

    atten_mask field requirements:

    atten_mask (optional): tensor on the device. 1 indicates that this bit is not involved (invalid) in the compute; 0 indicates that this bit is involved in the compute. The data type can be BOOL or UINT8. The data format is ND. The input shape supports BNSS, B1SS, 11SS, and SS formats. In the varlen scenario, only the SS format is supported, that is maxSq and maxSkv.

    According to the description on the official website, the attention mask should be [maxSq, maxSkv], that is, [3577, 3577]. However, [query.shape[0]], key.shape[0], that is, [5079, 5079], is used in the actual code. Since the operator reads data row-wise, this causes incorrect 0 and 1 placements, leading to gradient overflow.

Solution: Correct attention_mask passed during npu_fusion_attention training.

Result: The gradient overflow disappears, and the loss converges properly.

Case 2

After a multimodal model is migrated from the GPU to the NPU and fine-tuned, and the FSDP framework is used, the training process shows a NaN loss at step 1.

Figure 3 Execution result on the NPU
Figure 4 Execution result on the GPU

Locating method:

  1. Reduce the scale.

    Training the model on 128 cards in the live network is expensive. You need to reduce the scale and layer count. After that, the issue can be reproduced on two cards on a single server.

  2. Use the dump tool to collect the mix-level data of step 1 (the step where NaN first occurs).

    After the tool is added, the NaN issue disappears.

    Remove the tool and enable stream synchronization for further verification.

    export ASCEND_LAUNCH_BLOCKING=1

    The issue disappears after stream synchronization is enabled.

    The previous two symptoms suggest possible memory corruption during FSDP model training.

  3. Narrow down the troubleshooting scope.

    The model consists of four parts: vae, dit, denoiser, and conditioner.

    NaN loss persists after training dit.transformer.layers instead of the full model. The issue is confirmed to be with transformer.layers.

  4. Print the gradient by manually mounting a local hook.

    The NaN loss in step 1 is not the first occurrence. The NaN in the backward gradient of post_attention_layernorm in step 0 appears first.

    Figure 5 Backward gradient of post_attention_layernorm

    Compared with the gradient data without NaN when stream synchronization is enabled, all parameters except the weight and bias of the input_layernorm and post_attention_layernorm layers can be matched.

    Figure 6 Gradient comparison with and without NaN

    The APIs in the dump are Functional.layer_norm.10 and Functional.layer_norm.11.

  5. Analyze the code.

    post_attention_layernorm is delivered twice for images and texts.

    Figure 7 post_attention_layernorm code

    Changing the count to 1 removes NaN. The issue arises when the operator is called multiple times.

    The memory corruption feature is analyzed based on whether the abnormal data is regular and continuous. Therefore, collect data before analyzing the memory corruption feature.

  6. Use asynchronous dump.

    Adding the dump tool removes NaN. This happens because calculating statistics (like min and max) and flushing data to drives interfere with how stream operators work, preventing NaN from reappearing.

    Asynchronous dump mode avoids synchronization triggers during training. Data is flashed to drives only after the current training step is completed, minimizing disruptions to operator execution and stream synchronization.

    The specific operation is as follows: Add async_dump: True to the config.json file.

    Collect the Functional.layer_norm.10 and Functional.layer_norm.11 data, as well as the torch.split.192 reverse data in between. The NaN can be reproduced when a single operator is dumped.

  7. Analyze the asynchronous dump data.

    Refer to the dump.json file without NaN loss. The input of torch.split.192.backward should be the output of Functional.layer_norm.11. However, when stream synchronization is disabled, the input of torch.split.192.backward of asynchronous dump is different from the output of Functional.layer_norm.11.

    Figure 8 Feature analysis code

    The corrupted memory spans 2,048 bytes (bytes 0–2,047 differ, while bytes 2,048–3,071 match). This matches the pattern of memory corruption.

    Figure 9 Data difference before and after the corruption
  8. Print the operator memory address.

    Modify the torch_npu source code to print the ptr address and shape of the input and output tensors of the operator.

    Figure 10 ptr memory address printing result

    According to the log, in two consecutive layernorms, the output of the cast operator corrupts the input of the concat operator (the addresses of the two operators are the same).

    Check the corruption.

    Figure 11 Logic diagram of the corruption

    Root cause of memory corruption: The backend of the record is missing, the FSDP with multiple streams is used, and the layernorm is delivered continuously.

Solution: Add a record to the FSDP unshard stream of torch_npu2.3 to ensure that the tensor memory is not allocated by the next operator before the current operator on the stream is executed.

Result: The NaN loss disappears and the model is converged properly.