Accurate Issue Locating by Scenario

  • First step loss difference

    Perform the following procedure to locate the issue:

    1. Use the accuracy pre-check tool to check suspicious operators.
    2. Use the accuracy collection dump tool to collect the NPU and benchmark mix-level data corresponding to the step at which the difference first occurs.
    3. Use the hierarchical visualization tool or Model Accuracy Analyzer to compare graphs or tables. Identify the suspicious operator based on the color from dark to light or the node whose first input is consistent but output is inconsistent.
    4. Perform single-operator verification on the NPU and benchmark, and compare the Euclidean distance with the absolute benchmark of the CPU.
  • Loss difference of long-term stable training

    Use one or more of the following methods for analysis:

    • Use the training status monitoring tool (applicable to large-scale scenarios where the number of steps is not clear). Collect data based on the loss and gradient.
      • If the loss and gradient are affected, collect gradient and communication data during training.

        After gradient collection, pay attention to the following: abnormal cards/computations before and after gradient reduction, the number of steps/layers where gradients differ from the benchmark, and layers where the gradient change trend with steps is consistent with the overall gradient change trend.

      • If the loss is mainly affected, collect activations and weights during training.
    • Use the dump collection and comparison tool (applicable to small-scale scenarios where the faulty step is clear) to collect mix-level data of the first step where the difference occurs. Use hierarchical visualization to compare graphs or use Model Accuracy Analyzer to compare tables. Identify the suspicious operator based on the color from dark to light or the first node where the input accuracy is normal but the output accuracy is abnormal.
    • Use the accuracy pre-check tool to check suspicious operators.
  • Overflow or NaN

    Generally, the overflow or NaN occurs more frequently than the benchmark, and the loss scale decreases continuously.

    It falls into two main types:

    • If NaN occurs on the NPU but not on the GPU, the first step where NaN occurs is considered as the faulty step.
    • If NaN occurs on both the NPU and GPU, but the number of NaN times on the NPU is significantly greater than that on the GPU within 50 steps, the first step where NaN occurs on the NPU but not on the GPU is considered as the initial step for fault locating.

    After determining the step to be analyzed, perform the following steps:

    • Ensure that the Inf/NaN mode or non-saturation mode is enabled. Check whether the value of the environment variable INF_NAN_MODE_ENABLE is 1. If yes, the mode is enabled.
    • Collect the NPU and GPU data of this step for analysis and comparison.
      • Use the accuracy collection tool dump to collect the forward and backward input and output of the overflow step. If adding the tool resolves the overflow, this may indicate memory corruption. Switch to asynchronous dumping for data collection (The specific operation is as follows: Add async_dump: True to the config.json file.) and analyze the parallel computing relationships using the profiling tool.
      • Use the training status monitoring tool to collect gradients of each layer.
      • Use Model Accuracy Analyzer to compare tables or use the hierarchical visualization tool to compare graphs to locate the suspicious operator.