Logits Collection and Comparison

Overview

After selecting an appropriate bad case, it is necessary to collect the intermediate data from its inference process to locate the specific token that introduced the accuracy issue.

When locating accuracy issues, a typical approach is the "bottom-up" layered comparison method. This starts by comparing the logits from the last layer of each token's output in the model and identifies the first output token whose accuracy does not meet the benchmark accuracy.

For example, if Model A performs inference using the PyTorch framework on an external device and the MindIE Ascend Transformer Boost (ATB) framework on the Ascend device, you would first collect the logits results of Model A's inference on both the external device and the Ascend device. Then, compare the two sets of logits results to identify the specific token that introduced the accuracy issue.

Logits Collection

  1. Use the msit llm dump tool to collect the benchmark model logits. The benchmark model logits profiling can be achieved by adding dump logic code in the PyTorch execution script. For parameter meanings, refer to the Accuracy Data Profiling in PyTorch.
    1. Example of adding the dump-related configuration code in Model A's inference script:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      20
      21
      22
      import torch
      from msit_llm import DumpConfig, register_hook
      from transformers import AutoTokenizer, AutoModelForCausalLM
      
      # Before initializing the inference model, fix the random seed and enable deterministic computation.
      from msit_llm import seed_all
      seed_all(seed=2345)
      
      # Configure dump parameters:
      # dump_last_logits=True indicates that only the output (the logits of the output token) of the last layer of the model is collected.
      # token_range=list(range(1000)) indicates the logits of the first 1000 tokens. If the number of output tokens is less than 1000, the logits of all output tokens of the model are collected.
      
      dump_config = DumpConfig(dump_last_logits = True, token_range = list(range(1000)), dump_path = "dump data storage path")
      
      # Initialize the inference model.
      model_weight_path="Model A's weight path"
      tokenizer = AutoTokenizer.from_pretrained(model_weight_path)
      model = AutoModelForCausalLM.from_pretrained(model_weight_path).cuda()
      register_hook(model, dump_config)  # model represents the model instance that dumps intermediate tensors, and the code should be added after model initialization.
      
      with torch.no_grad():
          # Inference process code.
      

      Enabling deterministic computation is to ensure the reproducibility of results and avoid errors caused by randomness.

    2. After the script execution is completed, the benchmark model's logits will be saved to the directory specified by dump_path:
      - {DUMP_DIR}/ # Data storage path.
      - msit_dump_{TIMESTAMP}/ # Data dump timestamp directory.
          - torch_tensors/ # Tensor subdirectory.
            - cuda{device_id}_{PID}/ # Device process ID subdirectory.
             - {TID}  # Token ID
              - {LayerName} # Network layer name.
                - output.pth # Output tensor.
    3. (Optional) It is recommended that you dump the logits data twice and execute the following command to compare the logits of the same token from both dumps (here, comparing token 0, which corresponds to the output.pth data under folder 0).
      msit llm compare -gp  {DUMP_DIR1}/*/0/output.pth -mp  {DUMP_DIR2}/*/0/output.pth

      The accuracy metrics for the logits of the same token from both dumps are shown in Figure 1, indicating that deterministic computation has been correctly enabled.

      Figure 1 Accuracy metrics
  2. Use msit llm dump to collect ATB model logits. MSIT supports dumping the data of ATB models through command-line instructions. For parameter meanings and references, refer to Instructions for Dumping ATB Data.
    1. Execute the following command to obtain the model's network structure information, which is required during the first dump.
      msit llm dump --exec "bash run.sh" --type layer -seed 2345
    2. After execution is complete, the data dump directory structure is as follows:
      - {DUMP_DIR}/ # Data storage path.
      - msit_dump_{TIMESTAMP}/ # Data dump timestamp directory.
          - layer/ # Network structure subdirectory.
            - {PID}/ # Process ID.
              - {LayerName}.json

      Open the PID folder of Model A. The file content is shown in Figure 2.

      Figure 2 Files in the PID folder

      If using MindIE RC1 and later versions, the ATB model consists of the Prefill model and the Decoder model. Therefore, the logits of the ATB need to be collected separately for the Prefill logits and the Decoder logits. For example, if Model A has a total of 68 layers, where the last layer output of the Prefill model is LmHead_33 and the last layer output of the Decoder model is LmHead_67, it is necessary to collect the logits from layers 33 and 67.

    3. During the second dump, collect the last layer outputs of Model A by specifying -ids 33,67.
      msit llm dump --exec "bash run.sh" --type model tensor -ids 33,67 -er 0,1000 -child False -stp 1 -seed 2345
    4. After the collection is complete, the data structure is as follows:
      - {DUMP_DIR}/ # Data storage path.
      - msit_dump_{TIMESTAMP}/ # Data dump timestamp directory.
          - tensors/ # Tensor subdirectory.
            - {device_id}_{PID}/ # Device process ID subdirectory.
             - {TID}  # Token ID
              - {LayerName} # Network layer name.
               - after    # Output tensor directory of the network layer.
                  - outtensor0.bin # Output tensor.
               - before   # Input tensor directory at the network layer.

Comparing Logits

  1. Use the msit llm compare tool to compare the logits of the benchmark model with logits of the ATB model.

    After collecting the logits data for the benchmark model and the ATB model, you can use the msit llm compare tool to automatically compare the logits. For parameter meanings, refer to LLM Accuracy Comparison.

    An example of the comparison commands:

    msit llm compare -gp {GOLDEN_DUMP_DIR}/msit_dump_{TIMESTAMP}/torch_tensors/cuda{device_id}_{PID}/ -mp {ATB_DUMP_DIR}/msit_dump_{TIMESTAMP}/tensors/{device_id}_{PID}  -o "Comparison result saving path"
  2. After the comparison is complete, the msit_cmp_report_{TIMESTAMP}.csv file is generated and saved in the comparison result saving path. For details about the results, see Parameters in the Accuracy Comparison Result.

    Different service scenarios have different accuracy requirements. Generally, the following requirements can be used as a reference for key metrics:

    • Kullback-Leibler (KL) divergence: bfloat16 < 0.005, float16 < 0.0001
    • Cosine similarity: > 0.999
  3. Analyze the comparison result.
    According to the accuracy standard, the following two situations may occur:
    1. If all logits meet the accuracy requirements for the current scenario, it indicates that the issue lies with the "post-processing" output. In this case, you can recheck the "post-processing" parameters such as temperature, topK, topP, and others, or debug the implementation of the "post-processing" code to locate the specific cause.
    2. If there are results in the logits that do not meet the accuracy standards, identify the first token whose accuracy does not meet the standard from the comparison results, and perform a full-network accuracy comparison.

    An example of the accuracy comparison results for Model A is shown in Figure 3. It was found that the logits of token 3 exhibit accuracy degradation, with a cosine similarity of 0.975758 and a KL divergence of 19.457. Therefore, a full-network accuracy comparison will be performed for token 3 to further locate the issue.

    Figure 3 Accuracy comparison result of Model A