Error Code 0x10
Fault Locating
***********************2. AICERROR code*********************** # Gives the AI Core error code and description. code : 0x10
This error code indicates that the instruction is invalid. Generally, this error occurs because the UB address is not aligned or some scalar instructions are invalid. In this case, you only need to find the corresponding operator instruction for analysis.
Troubleshooting Procedure
- Locate the faulty operator based on the device ID, core ID, task ID, stream ID, node name, kernel name, and operator address in listed in 1. Basic Information of the log.
***********************1. Basic information******************** # Gives the basic information about the device occurred with the AI Core error. # kernel name: operator kernel name # op address: address of the operator code in the DDR # args address: address of the operator arguments in the DDR error time : 2020-08-26-11:24:07 device id : 0 core id : 0 task id : 60 stream id : 517 node name : trans_TransData_167 kernel name : te_transdata_16b6e15e2a5cc7f70_33e5fb7ae8478ddb op address : 0x101000120000 args address : 0X101000053000
- View the Python code line number of the faulty operator in 3. Instructions in the log to locate the error in the corresponding file. If the error fails to be located, perform subsequent steps.
***********************3. Instructions************************ # Gives the error instructions. start pc : 0x101000120000 current pc : 0x1010001201e0 Error occured most likely at line: 1d0 /{IDE path}/aicerror_xxxx/te_transdata_16b6e15e2a5cc7f70_33e5fb7ae8478ddb.o.txt:1d0 {IDE path}/collection/compile/kernel_meta/te_transdata_16b6e15e2a5cc7f70_33e5fb7ae8478ddb.cce:32 //CCE code line number of the error operator /{Python script path}/nz_2_nd.py:4486 //Python code line number of the error operator related instructions (error occured before the mark *): 1bc: <not available> 1c0: <not available> 1c4: <not available> 1c8: <not available> 1cc: <not available> 1d0: <not available> 1d4: <not available> 1d8: <not available> 1dc: <not available> * 1e0: <not available> For complete instructions, see /{IDE path}/aicerror_xxxx/te_transdata_16b6e15e2a5cc7f70_33e5fb7ae8478ddb.o.txt - Check the log to obtain the instruction node that stops running abnormally based on the difference between start pc and current pc provided in 3. Instructions, find the corresponding instruction in the decompilation file .o.txt, and check the data storage operation performed by the instruction.
- Search for other instructions that contain the same address operations and check whether the value in each instruction has UB overwriting. For example, in the mini environment, there is no separate scalar buffer. A space divided on the UB is used as the scalar buffer. Therefore, the built .o file must be within the 256 KB range of the UB. However, in the cloud, the scalar buffer is an independent buffer and can exceed 256 KB. Therefore, it is suspected that the .o file built by the operator is of the cloud version.
- Both TBE and TIK operators are built using ccec.py to generate .o, .json, and .cce files. Modify ccec.py and print the build information. After the build is performed again, check the --cce-aicore-arch field in the printed information. It is found that the previous operator build is of the mini version, which changes to the cloud version at an operator.
- Check the operator that causes version conversion. It is found that the operator is implemented by TIK and supports only the cloud version. After the operator is executed, the version is changed to the cloud version. As a result, the subsequent build operators are all of the cloud version, causing UB overwriting in the mini environment and invalid instructions.
Parent topic: Troubleshooting Guide