Special Cases

If the root cause of a difficult issue cannot be found after the tool is used for locating, try the following methods:

  • Enable stream synchronization to check for memory corruption during parallel computing. To do so, configure the following environment variable:
    export ASCEND_LAUNCH_BLOCKING=1
  • Disable npu_fusion_attention. The operator is complex, so parameter passing errors may occur. If NaN occurs, you can disable npu_fusion_attention to determine whether the fault is caused by npu_fusion_attention. For details about the usage specifications, see npu_fusion_attention. For example, to disable npu_fusion_attention in MindSpeed LLM, perform the following step: deleting the --use_fused_attn parameter.
  • In the Megatron model, there is a high risk of overlap parameters. You can delete the following hyperparameters first: such as --overlap-param-gather and --overlap-grad-reduce.
  • Check the Matmul staggered policy. If a suspicious Matmul operator is located but cannot be excluded by single-operator verification, you can disable the staggered policy by setting the following environment variable:
    export CLOSE_MATMUL_K_SHIFT=1
  • Check the optimizer. For example, convert the Adam optimizer to SGD, replace the Adam fused operator with smaller ones, and disable the optimizer to demarcate the forward and backward propagation problems.
  • Use the accuracy pre-check tool to check whether there are suspicious operators.