Single-operator Running Error
Analysis Result
If the following conclusion is provided in the info.txt file, an error is reported during the running of a single operator or the pass is abnormal.
**********************Root cause conclusion****************** Single op test aicore error, please check op.
In this case, you can use the single-operator reproduction method to rectify the fault based on the following information in the info.txt file: You can also run the python3 test_single_op.py command to view the error information reported during the execution of the abnormal operator and analyze the cause of the error.
- The following is an example of the operator information provided in Basic information:
***********************1. Basic information******************** error time : 2023-06-09-06:55:34.798.772 device id : 0 core id : 0 task id : 6 stream id : 7 node name : GatherV2 kernel name : te_gatherv2_657cb48fa1743a43209d7bc779fe8c294760a5b09b3079a3323fdf18376fc408_1
- The following is an example of the detailed error message provided in AICERROR code:
***********************2. AICERROR code*********************** error code : 0x10 error bits : CCU_ERR_INFO: 0x2c6290000324442 ccu_err_addr bit[22:8]=011001001000100 meaning:CCU Error Address [17:3] approximate:0x19220 ccu_illegal_instr, Illegal instruction. 1: instruction binary error; 2: instruction address misalignment
- The following is an example of the CCE line number provided in Instructions:
***********************3. Instructions************************ start pc : 0x1000124080064000 current pc : 0x124080067d2c instruction : Error occured most likely at line: 3d08 /home/HwHiAiUser/tf/info_20230609065654/aicerror_0_20230609065534/te_gatherv2_657cb48fa1743a43209d7bc779fe8c294760a5b09b3079a3323fdf18376fc408_1.o.txt:3d08 te_gatherv2_657cb48fa1743a43209d7bc779fe8c294760a5b09b3079a3323fdf18376fc408_1.cce:1364 /usr/local/Ascend/latest/opp/built-in/op_impl/ai_core/tbe/impl/dynamic/gather_v2.py:1214 /usr/local/Ascend/latest/opp/built-in/op_impl/ai_core/tbe/impl/dynamic/gather_v2.py:1214 related instructions (error occured before the mark *): 3d08: <not available> 3d0c: <not available> 3d10: <not available> 3d14: <not available> 3d18: <not available> 3d1c: <not available> 3d20: <not available> 3d24: <not available> 3d28: <not available> * 3d2c: <not available> For complete instructions, please view /home/tf/info_20230609065654/aicerror_0_20230609065534/te_gatherv2_657cb48fa1743a43209d7bc779fe8c294760a5b09b3079a3323fdf18376fc408_1.o.txt
Fault root causes
The possible causes are as follows:
- The logic of the operator is incorrect. For example, some unallocated addresses are read or written.
- The compiler is incorrect. The logic of the operator is correct, but the compiler compiles the operator into an incorrect bottom-layer assembly instruction. As a result, the chip reports an AI Core error.
- The input data is incorrect. Generally, the input data is used as the index input. Incorrect indexes cause unauthorized access to operators.
- The TilingData data calculation or selection is incorrect.
The msaicerr tool provides the capability of parsing TilingData in one-click mode. Go to the ${install_path}/tools/msaicerr directory (${install_path} indicates the Toolkit installation path) and run the following command (replace the plog log file based on the actual situation):
python3 msaicerr.py -t plog_xxxxxx.log
The tool execution result is as follows:
root@ubuntu:/home/HwHiAiUser/tf/aicerr/asys_output_20240124171303003/dfx/log/host/cann/debug/plog# python3 /usr/local/Ascend/latest/tools/msaicerr/msaicerr.py -t plog-141054_20240124171303866.log 2024-04-25 11:30:43 (488) - [INFO] The tool directory will be used to as the output address of the analysis report. 2024-04-25 11:30:43 (488) - [INFO] Total device count: 1 2024-04-25 11:30:43 (488) - [INFO] Start to get tiling date. 2024-04-25 11:30:43 (488) - [INFO] Failed to execute command:['grep', 'exception info dump args data', '-inrE', 'plog-141054_20240124171303866.log']. 2024-04-25 11:30:43 (488) - [INFO] Unable to get L0 dump log, start to guess tilingptr index. 2024-04-25 11:30:43 (488) - [INFO] Find tiling_data index: 13 2024-04-25 11:30:43 (488) - [INFO] get args: ['0x1240413d0000', '0x12404133e0000', '0x12404133c0000', '0x124100090068', '0', '0', '0', '0', '0', '0', '0', '0', '0x1', '0x1', '0x40', '0x1', '0x2', '0x1', '0x1'], tiling_ptr: 0x124100090068, offset: 13 2024-04-25 11:30:43 (488) - [INFO] Get tiling data success! 2024-04-25 11:30:43 (488) - [INFO] tiling data in uint32: tiling data: [1, 0, 1, 0, 64, 0, 1, 0, 2, 0, 1, 0, 1, 0] tiling data in int64: tiling data: [1, 1, 64, 1, 2, 1, 1] tiling data in float16: tiling data: [5.960464477539063e-08, 0.0, 0.0, 0.0, 5.960464477539063e-08, 0.0, 0.0, 0.0, 3.814697265625e-06, 0.0, 0.0, 0.0, 5.960464477539063e-08, 0.0, 0.0, 0.0, 1.1920958955078125e-07, 0.0, 0.0, 0.0, 5.960464477539063e-08, 0.0, 0.0, 0.0, 5.960464477539063e-08, 0.0, 0.0, 0.0] 2024-04-25 11:30:43 (488) - [INFO] Tiling data saved in tilingdata_1714015843.bin
The tool cannot know the data type of tiling. Therefore, int32, int64, and float16 are used for parsing. Operator developers can view the data based on the tiling data type. Check whether the tiling is correct based on the meaning of the tiling parameter. For example, if the value of core_num in tilingdata is not 32 or 48, the tiling data is incorrect.
Solution
This type of problem can be located and handled by reproducing a single operator.
When finishing the analysis, the msaicerr.py tool generates a single-operator test script (as shown in the following information in bold). Developers can run the script to reproduce the AI Core error.
2023-06-09 06:56:58 (101494) - [INFO] The find single op log /home/HwHiAiUser/tf/single_op_log_20230609065654/debug/plog/plog-101494_20230609065657791.log 2023-06-09 06:56:58 (101494) - [INFO] Generate case file /home/HwHiAiUser/AicoreError/tools/msaicerr/test_single_op.py 2023-06-09 06:56:58 (101494) - [INFO] ################################################## 2023-06-09 06:56:58 (101494) - [INFO] single op test failed! Please check OP or input data! 2023-06-09 06:56:58 (101494) - [INFO] Run 'export PYTHONPATH=/usr/local/Ascend/CANN-7.3/tools/msaicerr/:$PYTHONPATH;cd /usr/local/Ascend/CANN-7.3/tools/msaicerr;python3 /home/xxxxxxx/xxx/info_xxxx/aicerror_xxxx/test_single_op.py' can test op! 2023-06-09 06:56:58 (101494) - [INFO] ################################################## 2023-06-09 06:56:58 (101494) - [INFO] The ai core error info for No.0 is saved in /home/HwHiAiUser/tf/info_20230609065654/aicerror_0_20230609065534/info.txt 2023-06-09 06:56:58 (101494) - [INFO] Finish to analyze each ai core error. 2023-06-09 06:56:58 (101494) - [INFO] The summary info is saved in /home/HwHiAiUser/tf/info_20230609065654/README.txt 2023-06-09 06:56:58 (101494) - [INFO] Analysis finished, please check /home/HwHiAiUser/tf/info_20230609065654, you can view README.txt first.
