AI Core Troubleshooting

Principles

Application scenario: For example, if an AI Core error is reported during network inference (dynamic shape scenarios unsupported), you can refer to this section to obtain the description of the error operator and then perform further troubleshooting.

The recommended API call sequence is as follows:
  1. Define and implement the exception callback function fn (of the aclrtExceptionInfoCallback type). For details about the callback function prototype, see aclrtSetExceptionInfoCallback.

    The key logics for implementing the callback function are as follows:

    1. Call aclrtGetDeviceIdFromExceptionInfo, aclrtGetStreamIdFromExceptionInfo, and aclrtGetTaskIdFromExceptionInfo in the exception callback function to obtain the device ID, stream ID, and task ID, respectively.
    2. Call aclmdlCreateAndGetOpDesc in fn to obtain the operator description.
    3. Call aclGetTensorDescByIndex in fn to obtain the input/output tensor description of the operator.
    4. In fn, obtain the tensor description for further analysis.

      For example, call aclGetTensorDescAddress to obtain the tensor data address, call aclGetTensorDescType to obtain the data type in the tensor description, call aclGetTensorDescFormat to obtain the tensor format, call aclGetTensorDescNumDims to obtain the tensor dimension count, and call aclGetTensorDescDimV2 to obtain the size of a specified dimension.

  2. Call aclrtSetExceptionInfoCallback to set the exception callback function.
  3. Execute model inference.

    If an AI Core error is reported, fn is triggered to obtain the operator information for further troubleshooting.

Sample Code

After APIs are called, you need to add exception handling branches and record error logs and info logs. The following is a code snippet of key steps only, which is not ready to be built or run.

This section describes the code logic for obtaining AI Core exception information. For details about initialization and deinitialization, see Initialization and Deinitialization. For details about how to allocate and deallocate runtime resources, see Runtime Resource Allocation and Deallocation. For details about how to load a model, prepare the input/output data of model inference, and execute and unload a model, see Model Inference with Static-Shape Inputs.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
//1. Perform initialization.

// 2. Allocate runtime resources.

//3. Load your model. After the model is successfully loaded, modelId that identifies the model is returned.

//4. Create data of type aclmdlDataset to describe the inputs and outputs of your model.

//5. Implement an exception callback function.
void callback(aclrtExceptionInfo *exceptionInfo)
{
    deviceId = aclrtGetDeviceIdFromExceptionInfo(exceptionInfo);
    streamId = aclrtGetStreamIdFromExceptionInfo(exceptionInfo);
    taskId = aclrtGetTaskIdFromExceptionInfo(exceptionInfo);
    
    char opName[256];
    aclTensorDesc *inputDesc = nullptr;
    aclTensorDesc *outputDesc = nullptr;
    size_t inputCnt = 0;
    size_t outputCnt = 0; 
    //You can write the obtained operator information to a file, or start another thread. When an error occurs, the thread handling function is triggered, and the operator information is printed to the screen.
    aclmdlCreateAndGetOpDesc(deviceId, streamId, taskId, opName, 256,  &inputDesc, &inputCnt, &outputDesc, &outputCnt);
    //You can call related tensor APIs to obtain the operator information as required.
    for (size_t i = 0; i < inputCnt; ++i) {
        const aclTensorDesc *desc = aclGetTensorDescByIndex(inputDesc, i);
        aclGetTensorDescAddress(desc);
        aclGetTensorDescFormat(desc);
    }
    for (size_t i = 0; i < outputCnt; ++i) {
        const aclTensorDesc *desc = aclGetTensorDescByIndex(outputDesc, i);
        aclGetTensorDescAddress(desc);
        aclGetTensorDescFormat(desc);
    }
    aclDestroyTensorDesc(inputDesc);
    aclDestroyTensorDesc(outputDesc);
}

//6. Set the exception callback.
aclrtSetExceptionInfoCallback(callback);

//7. Execute your model.
aclmdlExecute(modelId, input, output);

//8. Process the model inference result.

//9. Destroy the model input and output descriptions, free up memory, and unload the model.

//10. Deallocate runtime resources.

//11. Perform deinitialization.

// ......