Using the msaicerr Tool to Analyze AI Core Errors
Restrictions
- This tool can be used only for local analysis. So the environment where the tool is deployed must be the same as the environment where logs are stored (operating environment).
- This tool depends on python 3.7.5 or later versions. You need to pre-install python in the environment where this tool is installed.
- This tool cannot be used in RC mode.
- This tool cannot analyze AI Core errors of the following operators:
- Foreach operators
- GroupedMatmul
- NonFiniteCheck
- GroupedMatMulAllReduce
- FusedInferAttentionScore
- ScatterList
- IncreFlashAttention
- MemSet
- NonMaxSuppressionBucketize
Prerequisite
You have installed the toolkit package in the CANN operating environment. For details, see CANN Software Installation Guide.
The CANN basic environment variable has been configured. After the CANN software is installed, when you build and run your application as the CANN running user, log in to the environment as the CANN running user and run the source ${install_path}/set_env.sh command to set environment variables. {install_path} indicates the CANN installation path, for example, /usr/local/Ascend/ascend-toolkit.
Before using the msaicerr tool, obtain msaicerr.py from the Toolkit package installation path {$install_path}/tools/msaicerr directory.
Using the msaicerr Tool for Analysis
- Log in to the host server as the running user.
- Use the msaicerr tool to quickly locate the key causes of AI Core errors.
Go to the ${Toolkit installation path}/tools/msaicerr directory and run the following command to extract key information related to the AI Core error based on the information collected in Collecting AI Core Error Information. In the following command, aic_err_info_timestamp indicates the directory for storing AI Core error information. Replace it with the actual directory.
python3 msaicerr.py -p ${HOME}/aic_err_info_timestamp -out $HOME/resultIn the preceding command example, the -p parameter is used to specify the directory for storing fault information, for example, ${HOME}/aic_err_info_timestamp. The -out parameter is used to specify the path for storing the parsing result file, for example, $HOME/result; if the path is not specified, the parsing result is stored in the current path where the command is executed by default.
Note: Do not execute the msaicerr tool in the directory specified by the -p parameter or its subdirectory. For example, you cannot execute the msaicerr tool in the aic_err_info_timestamp directory or its subdirectory. The directory specified by the -out parameter must not be the directory specified by the -p parameter or its subdirectory. Otherwise, the tool parsing may be suspended or fail.
- If the msaicerr tool fails to be executed:
- Check whether the prerequisites for using the tool are met and whether the information collected in Collecting AI Core Error Information is complete.
- Check the operator parameters by referring to The Operator Input Arg Error.
- If the fault persists, contact technical support. After obtaining the logs, click here to contact technical support.
- If the error message "ModuleNotFoundError: No module named 'google'" is displayed when you run the msaicerr.py script, the protobuf library (a data format for storing data) is missing. You need to run the pip3 install protobuf --user command to install the protobuf library and then run the script.
- If the error message "ModuleNotFoundError: No module named 'chardet'" is displayed when you run the msaicerr.py script, the chardet library (used to detect character encoding) is missing. You need to run the pip3 install chardet --user command to install the chardet library and then run the script.
- If the error message ModuleNotFoundError: No module named "bfloat16ext" is displayed when you run the msaicerr.py script, the bfloat16ext library (used to parse data of the bf16 type) is missing. In this case, run the pip3 install bfloat16ext --user command to install the bfloat16ext library and then run the script.
In addition, you can run the python3 msaicerr.py -h command to view the meaning of each parameter.
After the command is executed, the info_{timestamp}/aicerror_{number}_{timestamp}/info.txt file and the abnormal operator test file test_single_op.py (this file exists only when an abnormal operator exists) are generated in the same directory where the command is executed.
You can use the info.txt file to analyze and locate problems. Pay attention to the key information in Root cause conclusion, as shown in the following table. If the information collected in Collecting AI Core Error Information contains multiple AI Core errors, the msaicerr tool parses the AI Core error that occurs for the first time based on the log time.
Table 1 Key information in Root cause conclusion Key Information
Possible Cause
Typical Cases and Handling Methods
Golden Op run error on your environment.
The operating environment is abnormal.
Single op test aicore error, please check op.
Errors occur in single-operator implementation or compilation process.
Check whether the operator is a user custom operator or a built-in CANN operator based on the displayed information.
If the tool automatically generates a single-operator case script, the following information is displayed in the log. You can run the script to reproduce the single-operator fault and locate the fault as prompted. If the fault cannot be reproduced or located, contact technical support. After obtaining the logs, click here to contact technical support.
Run 'export PYTHONPATH=/usr/local/Ascend/CANN-7.3/tools/msaicerr/:$PYTHONPATH;cd /usr/local/Ascend/CANN-7.3/tools/msaicerr;python3 /home/xxxxxxx/xxx/info_xxxx/aicerror_xxxx/test_single_op.py' can test op!
Atomic accumulation exception
Overflow occurs due to precision problems.
The addr of input/output is used but not alloc
The input and output data addresses of the operator are abnormal.
The args of op is differrent before and after execute
The input and output parameters of the operator are abnormal.
Inconsistent Dispensation Before And After The Operator Input Arg
Dump data failed in exception dump! Address of input or output is error!
A framework memory allocation error occurs. In this case, you need to determine whether it is the GE framework or another framework and contact technical support. After obtaining the logs, click here to contact technical support.
The number of AI Cores in the environment is less than that required by the operator.
The number of AI Cores in the environment is less than that required by the operator.
Check whether the number of AI Cores in the environment where the msaicerr tool is used is the same as that in the actual environment. If yes, contact technical support. After obtaining the logs, click here to contact technical support.
The memset or atomic_clean operator is not inserted before this operator in the graph, while memory cleanup is required before operator execution.
Graph build is abnormal.
Contact technical support. After obtaining the logs, click here to contact technical support.
The set_flag and wait_flag instructions are not used together in the operator code.
The set_flag and wait_flag instructions do not match.
For CANN built-in operators, contact technical support. After obtaining the logs, click here to contact technical support.
For custom operators, check the operator code by yourself.
There's no obvious known error, so I can't determine what the error is.
The msaicerr tool is executed successfully, but no fault is found after parsing.
Contact technical support. After obtaining the logs, click here to contact technical support.
- If the msaicerr tool fails to be executed: