AI Core Error Analysis Tool User Guide (msaicerr Tool)

Application Scenarios

During service execution, if the log file or screen print information contains the following AI Core error, you need to use the fault information collection tool asys to collect fault information, and then use the AI Core error analysis tool msaicerr to quickly locate the key cause of the AI Core error, which improves the efficiency of checking AI Core errors.

# Error message
there is an xx aicore error

# Or error example
there is an xx aivec error

Restrictions

  1. This tool can be used only for local analysis. So the environment where the tool is deployed must be the same as the environment where logs are stored (operating environment).
  2. This tool depends on python 3.7.5 or later versions. You need to pre-install python in the environment where this tool is installed.
  3. This tool cannot be used in RC mode.
  4. This tool cannot analyze AI Core errors of the following operators:
    • Foreach operators
    • GroupedMatmul
    • NonFiniteCheck
    • GroupedMatMulAllReduce
    • FusedInferAttentionScore
    • ScatterList
    • IncreFlashAttention
    • MemSet
    • NonMaxSuppressionBucketize

Prerequisite

You have installed the toolkit package in the CANN operating environment. For details, see CANN Software Installation Guide.

The CANN basic environment variable has been configured. After the CANN software is installed, when you build and run your application as the CANN running user, log in to the environment as the CANN running user and run the source ${install_path}/set_env.sh command to set environment variables. {install_path} indicates the CANN installation path, for example, /usr/local/Ascend/ascend-toolkit.

Before using the msaicerr tool, obtain msaicerr.py from the Toolkit package installation path {$install_path}/tools/msaicerr directory.

Usage Guidance

  1. Log in to the CANN operating environment as the running user.
  2. Use the fault information collection tool asys to collect fault information.
    • In the offline inference scenario, run the asys launch command to re-run the service and collect fault information
      asys launch --task="sh ../app_run.sh" [--output="path"]

      In the preceding command, task indicates the task to be re-run, and output indicates the directory for storing collected information. For details about the parameters and restrictions, see Service Re-run and Fault Information Collection.

      Note: In the inference scenario, you need to run the asys launch command to re-run the ATC model conversion task, and then execute the asys launch command to re-run the inference service on the recompiled model. In addition, you need to place the maintenance and test information collected during model conversion and inference in the same directory, for example, $HOME/asys_output.

    • In the training scenario, run the asys collect command to collect fault information.
      asys collect [--output="path"]

      output indicates the directory for storing the collected information. For details about the parameters and restrictions, see Fault Information Collection.

      Note: In the cluster scenario, the asys tool cannot collect fault information about all nodes in one-click mode. You need to locate the node where the error is reported and use the asys tool to collect fault information on the node.

  3. Use the AI Core error analysis tool msaicerr to quickly locate the key causes of AI Core errors.

    Go to the Toolkit package installation path {$install_path}/tools/msaicerr directory and run the following command to extract key information related to AI Core error based on the fault information obtained from 2.

    python3 msaicerr.py -p $HOME/asys_output
    • In the fault information obtained from 2, check whether dump files exist in the dfx/data-dump directory and whether operator compilation information (operator compilation *.o and *.json files) exists in the dfx/ops directory in advance. Check whether log files exist in the dfx/log/cann directory. If no log file exists, the msaicerr tool cannot be used to extract AI Core error information.
    • If the error message ModuleNotFoundError: No module named "google" is displayed when you run the msaicerr.py script, the protobuf library (a data format for storing data) is missing. You need to run the pip3 install protobuf --user command to install the protobuf library and then run the script.
    • If the error message ModuleNotFoundError: No module named "chardet" is displayed when you run the msaicerr.py script, the chardet library (used to detect character encoding) is missing. You need to run the pip3 install chardet --user command to install the chardet library and then run the script.
    • If the error message ModuleNotFoundError: No module named "bfloat16ext" is displayed when you run the msaicerr.py script, the bfloat16ext library (used to parse data of the bf16 type) is missing. In this case, run the pip3 install bfloat16ext --user command to install the bfloat16ext library and then run the script.

    In addition, you can run the python3 msaicerr.py -h command to view the meaning of each parameter.

    Replace $HOME/asys_output in the preceding command with the directory stored in 2. After the command is executed, info_{timestamp}/{aicore_{number}_{timestamp}/info.txt and file and abnormal operator test file test_single_op.py (This file exists only when abnormal operators exist.) are generated in the same directory as the command.

    You can use the info.txt file to analyze and locate problems. Pay attention to the key information contained in Root cause conclusion, as shown in the following table. You can also run the python3 test_single_op.py command to view the error information reported during the execution of the abnormal operator and analyze the cause of the error.

    For details about the example of the info.txt file and the analysis methods of various problems, see Using the msaicerr Tool to Analyze AI Core Errors. If the fault information contains multiple AI Core errors, the msaicerr tool parses the AI Core error that occurs for the first time based on the log time.