AI Core Troubleshooting
Principles
Application scenario: For example, if an AI Core error is reported during network inference (dynamic shape scenarios unsupported), you can call this API to obtain the description of the error operator and then perform further troubleshooting.
- Define and implement the exception callback function fn (of the aclrtExceptionInfoCallback type). For details about the callback function prototype, see acl.rt.set_exception_info_callback.
The key steps for implementing the callback function are as follows:
- Call acl.rt.get_device_id_from_exception_info, acl.rt.get_stream_id_from_exception_info, and acl.rt.get_task_id_from_exception_info in fn to obtain the device ID, stream ID, and task ID, respectively.
- Call acl.mdl.create_and_get_op_desc in fn to obtain the operator description.
- Call acl.get_tensor_desc_by_index in fn to obtain the input/output tensor description of the operator.
- In fn, obtain the tensor description for further analysis.
For example, call acl.get_tensor_desc_address to obtain the tensor data address, call acl.get_tensor_desc_type to obtain the data type in the tensor description, call acl.get_tensor_desc_format to obtain the tensor format, call acl.get_tensor_desc_num_dims to obtain the tensor dimension count, and call acl.get_tensor_desc_dim_v2 to obtain the size of a specified dimension.
- Call acl.rt.set_exception_info_callback to set the exception callback function.
- Perform model inference.
If an AI Core error is reported, fn is triggered to obtain the operator information for further troubleshooting.
Sample Code
Add an exception handling branch following the API calls. The following is a code snippet of key steps only, which is not ready to use.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
import acl import numpy as np # ...... # 1. Allocate runtime resources. # ...... # 2. Load a model. After the model is successfully loaded, model_id that identifies the model is returned. # ...... # 3. Create data of type aclmdlDataset to describe the inputs and outputs of the model. # ...... # 4. Implement the exception callback function. def exception_callback(info): stream_id = acl.rt.get_stream_id_from_exception_info(info) thread_id = acl.rt.get_thread_id_from_exception_info(info) device_id = acl.rt.get_device_id_from_exception_info(info) task_id = acl.rt.get_task_id_from_exception_info(info) # You can write the obtained operator information to a file, or start another thread to listen to the exception callback. When an error occurs, the thread handling function is triggered, and the operator information is printed to the screen. op_name, input_desc, num_inputs, output_desc, num_outputs, ret = \ acl.mdl.create_and_get_op_desc(device_id, stream_id, task_id, 256) # You can call related tensor APIs provided by pyACL to obtain the operator information as required. for i in range(num_inputs): desc = acl.get_tensor_desc_by_index(input_desc, i) address = acl.get_tensor_desc_address(desc) num_dims = acl.get_tensor_desc_num_dims(desc) dim_0, ret = acl.get_tensor_desc_dim_v2(desc, 0) for i in range(num_outputs): desc = acl.get_tensor_desc_by_index(output_desc, i) address = acl.get_tensor_desc_address(desc) num_dims = acl.get_tensor_desc_num_dims(desc) dim_0, ret = acl.get_tensor_desc_dim_v2(desc, 0) acl.destroy_tensor_desc(input_desc) acl.destroy_tensor_desc(output_desc) # 5. Set the exception callback function. ret = acl.rt.set_exception_info_callback(exception_callback) # 6. Execute the model. ret = acl.mdl.execute(model_id, input, output) # 7. Process the model inference result. # ...... # 8. Destroy allocations such as the model inputs and outputs, free memory, and unload the model. # ...... # 9. Destroy runtime allocations. # ...... |