AI Core Troubleshooting

Principles

Application scenario: For example, if an AI Core error is reported during network inference (dynamic shape scenarios unsupported), you can call this API to obtain the description of the error operator and then perform further troubleshooting.

The recommended API call sequence is as follows:
  1. Define and implement the exception callback function fn (of the aclrtExceptionInfoCallback type). For details about the callback function prototype, see aclrtSetExceptionInfoCallback.

    The key steps for implementing the callback function are as follows:

    1. Call aclrtGetDeviceIdFromExceptionInfo, aclrtGetStreamIdFromExceptionInfo, and aclrtGetTaskIdFromExceptionInfo in fn to obtain the device ID, stream ID, and task ID, respectively.
    2. Call aclmdlCreateAndGetOpDesc in fn to obtain the operator description.
    3. Call aclGetTensorDescByIndex in fn to obtain the input/output tensor description of the operator.
    4. In the exception callback function fn, call the following APIs to obtain the tensor description for further analysis.

      For example, call aclGetTensorDescAddress to obtain the tensor data address, call aclGetTensorDescType to obtain the data type in the tensor description, call aclGetTensorDescFormat to obtain the tensor format, call aclGetTensorDescNumDims to obtain the tensor dimension count, and call aclGetTensorDescDimV2 to obtain the size of a specified dimension.

  2. Call aclrtSetExceptionInfoCallback to set the exception callback function.
  3. Run model inference.

    If an AI Core error is reported, the callback function fn is triggered to obtain the operator information for further troubleshooting.

Sample Code

After APIs are called, you need to add exception handling branches and record error logs and info logs. The following is a code snippet of key steps only, which is not ready to be built or run.

This section describes the code logic for obtaining AI Core exception information. For details about how to initialize and deinitialize AscendCL, see Initializing AscendCL. For details about how to allocate and deallocate runtime resources, see Runtime Resource Allocation and Deallocation. For details about how to load a model, prepare the input/output data of model inference, and execute and unload the model, see Inference with Single-Batch and Static-Shape Inputs.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
//1. Initialize AscendCL.

//2. Allocate runtime resources.

//3. Load your model. After the model is successfully loaded, modelId that identifies the model is returned.

//4. Create data of type aclmdlDataset to describe the inputs and outputs of your model.

//5. Implement an exception callback function.
void callback(aclrtExceptionInfo *exceptionInfo)
{
    deviceId = aclrtGetDeviceIdFromExceptionInfo(exceptionInfo);
    streamId = aclrtGetStreamIdFromExceptionInfo(exceptionInfo);
    taskId = aclrtGetTaskIdFromExceptionInfo(exceptionInfo);
    
    char opName[256];
    aclTensorDesc *inputDesc = nullptr;
    aclTensorDesc *outputDesc = nullptr;
    size_t inputCnt = 0;
    size_t outputCnt = 0; 
    //You can write the obtained operator information to a file, or start another thread. When an error occurs, the thread handling function is triggered, and the operator information is printed to the screen.
    aclmdlCreateAndGetOpDesc(deviceId, streamId, taskId, opName, 256,  &inputDesc, &inputCnt, &outputDesc, &outputCnt);
    //You can call related tensor APIs provided by AscendCL to obtain the operator information as required.
    for (size_t i = 0; i < inputCnt; ++i) {
        const aclTensorDesc *desc = aclGetTensorDescByIndex(inputDesc, i);
        aclGetTensorDescAddress(desc);
        aclGetTensorDescFormat(desc);
    }
    for (size_t i = 0; i < outputCnt; ++i) {
        const aclTensorDesc *desc = aclGetTensorDescByIndex(outputDesc, i);
        aclGetTensorDescAddress(desc);
        aclGetTensorDescFormat(desc);
    }
    aclDestroyTensorDesc(inputDesc);
    aclDestroyTensorDesc(outputDesc);
}

//6. Set the exception callback.
aclrtSetExceptionInfoCallback(callback);

//7. Execute your model.
aclmdlExecute(modelId, input, output);

//8. Process the model inference result.

//9. Destroy the model input and output descriptions, free up memory, and unload the model.

//10. Deallocate runtime resources.

//11. Deinitialize AscendCL.

// ......