AI Core Troubleshooting
Principles
Application scenario: For example, if an AI Core error is reported during network inference (dynamic shape scenarios unsupported), you can call this API to obtain the description of the error operator and then perform further troubleshooting.
- Define and implement the exception callback function fn (of the aclrtExceptionInfoCallback type). For details about the callback function prototype, see aclrtSetExceptionInfoCallback.
The key steps for implementing the callback function are as follows:
- Call aclrtGetDeviceIdFromExceptionInfo, aclrtGetStreamIdFromExceptionInfo, and aclrtGetTaskIdFromExceptionInfo in fn to obtain the device ID, stream ID, and task ID, respectively.
- Call aclmdlCreateAndGetOpDesc in fn to obtain the operator description.
- Call aclGetTensorDescByIndex in fn to obtain the input/output tensor description of the operator.
- In the exception callback function fn, call the following APIs to obtain the tensor description for further analysis.
For example, call aclGetTensorDescAddress to obtain the tensor data address, call aclGetTensorDescType to obtain the data type in the tensor description, call aclGetTensorDescFormat to obtain the tensor format, call aclGetTensorDescNumDims to obtain the tensor dimension count, and call aclGetTensorDescDimV2 to obtain the size of a specified dimension.
- Call aclrtSetExceptionInfoCallback to set the exception callback function.
- Run model inference.
If an AI Core error is reported, the callback function fn is triggered to obtain the operator information for further troubleshooting.
Sample Code
After APIs are called, you need to add exception handling branches and record error logs and info logs. The following is a code snippet of key steps only, which is not ready to be built or run.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 |
//1. Initialize AscendCL. //2. Allocate runtime resources. //3. Load your model. After the model is successfully loaded, modelId that identifies the model is returned. //4. Create data of type aclmdlDataset to describe the inputs and outputs of your model. //5. Implement an exception callback function. void callback(aclrtExceptionInfo *exceptionInfo) { deviceId = aclrtGetDeviceIdFromExceptionInfo(exceptionInfo); streamId = aclrtGetStreamIdFromExceptionInfo(exceptionInfo); taskId = aclrtGetTaskIdFromExceptionInfo(exceptionInfo); char opName[256]; aclTensorDesc *inputDesc = nullptr; aclTensorDesc *outputDesc = nullptr; size_t inputCnt = 0; size_t outputCnt = 0; //You can write the obtained operator information to a file, or start another thread. When an error occurs, the thread handling function is triggered, and the operator information is printed to the screen. aclmdlCreateAndGetOpDesc(deviceId, streamId, taskId, opName, 256, &inputDesc, &inputCnt, &outputDesc, &outputCnt); //You can call related tensor APIs provided by AscendCL to obtain the operator information as required. for (size_t i = 0; i < inputCnt; ++i) { const aclTensorDesc *desc = aclGetTensorDescByIndex(inputDesc, i); aclGetTensorDescAddress(desc); aclGetTensorDescFormat(desc); } for (size_t i = 0; i < outputCnt; ++i) { const aclTensorDesc *desc = aclGetTensorDescByIndex(outputDesc, i); aclGetTensorDescAddress(desc); aclGetTensorDescFormat(desc); } aclDestroyTensorDesc(inputDesc); aclDestroyTensorDesc(outputDesc); } //6. Set the exception callback. aclrtSetExceptionInfoCallback(callback); //7. Execute your model. aclmdlExecute(modelId, input, output); //8. Process the model inference result. //9. Destroy the model input and output descriptions, free up memory, and unload the model. //10. Deallocate runtime resources. //11. Deinitialize AscendCL. // ...... |