Sample Code for Executing a Dynamic-Shape Operator (Operator Selector Registered)
This section describes the key APIs and sample code for calling the dynamic-shape operators in single-operator model execution mode. Since the application of the operator selector registered is limited, this section applies mainly for TIK custom operators.
Prerequisite
Before loading and executing a dynamic-shape operator, develop a custom operator and generate the corresponding binary file. For details, see "Special Topics > TIK Custom Operator with Dynamic Shape" in TBE&AI CPU Operator Developer Guide.
Principles
The procedure of loading and executing a dynamic-shape operator (with operator selector registered) is as follows:
- Initialize resources, including initializing AscendCL, setting the directory of your single-operator model file and setting the compute device.
- Call aclInit to initialize AscendCL.
- Call the AscendCL APIs to register your custom operator.
- Call aclopRegisterCompileFunc to register an operator selector (that is, the function for tiling policy selection). The selection of the tiling policy depends on the operator's input shape.
The operator selector needs to be defined and implemented by the user in advance.
- Function prototype
typedef aclError (*aclopCompileFunc)(int numInputs, const aclTensorDesc *const inputDesc[], int numOutputs, const aclTensorDesc *const outputDesc[], const aclopAttr *opAttr, aclopKernelDesc *aclopKernelDesc);
- Function implementation
Write a piece of code to implement the selection between tiling policies and the generation of tiling arguments. Use aclopSetKernelArgs to set the tiling arguments and number of times the AI Cores will be kickstarted for concurrent execution.
- Function prototype
- Call aclopCreateKernel to register the operator with the system so that the operator implementation can be located during operator execution.
- Call aclopRegisterCompileFunc to register an operator selector (that is, the function for tiling policy selection). The selection of the tiling policy depends on the operator's input shape.
- Call aclrtSetDevice to specify the compute device.
- Call aclrtCreateStream to create a stream explicitly.
The aclrtSetDevice call also implicitly creates a default stream. If no stream is created explicitly, the default stream is used. To pass the default stream to any API call, pass NULL directly.
- Construct the operator description information (including the input and output tensor descriptions and operator attributes) and allocate memory for storing the input and output data of the operator.
- Transfer the operator input data to the device memory.
- Compile your single-operator.
Call aclopUpdateParams to compile the operator and trigger the call logic of the operator selector.
- Execute your single-operator.
- Obtain the operator output data.
- Destroy the stream and context and reset the device in sequence.
- Call aclFinalize to deinitialize AscendCL.
Sample Code
Following the API calls in the sample, add exception handling branches and specify log printing of error and information levels. The following is a code snippet of key steps only, which is not ready to be built or run.
For the sample about building and running a dynamic-shape operator (with an operator selector registered), see "Sample Usage" in TBE&AI CPU Operator Developer Guide.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 |
#include "acl/acl.h" // ...... //1. Initialize resources. //This path is relative to the directory of the executable file. aclError ret = aclInit(NULL); // Specify a compute device. int32_t deviceId=0 ; ret = aclrtSetDevice(deviceId); //Explicitly create a stream. // To reserve the execution order of asynchronous tasks. aclrtStream stream; ret = aclrtCreateStream(&stream); aclopRegisterCompileFunc("BatchNorm", SelectAclopBatchNorm); //Build the .o file of the operator kernel in advance and call the user-defined function to load the file to the memory buffer. length indicates the memory size. To load multiple .o files, a separate call is needed for each .o file. aclopCreateKernel("BatchNorm", "tiling_mode_1__kernel0", "tiling_mode_1__kernel0", buffer, length, ACL_ENGINE_AICORE, Deallocator); //-----User-defined function BatchNormTest(n, c, h, w) start----- //2. Construct the input/output tensors and the input/output tensor descriptions of the BatchNorm operator. Allocate memory for storing the input and output data of the operator. aclTensorDesc *input_desc[3]; aclTensorDesc *output_desc[1]; input_desc[0] = aclCreateTensorDesc(ACL_FLOAT16, 4, shape_input, ACL_FORMAT_NCHW); input_desc[1] = aclCreateTensorDesc(ACL_FLOAT16, 1, shape_gamma, ACL_FORMAT_ND); input_desc[2] = aclCreateTensorDesc(ACL_FLOAT16, 1, shape_beta, ACL_FORMAT_ND); output_desc[0] = aclCreateTensorDesc(ACL_FLOAT16, 4, shape_out, ACL_FORMAT_NCHW); for (int i = 0; i < n * c * h * w; ++i) { input[i] = aclFloatToFloat16(1.0f); } for (int i = 0; i < c; ++i) { gamma[i] = aclFloatToFloat16(0.5f); beta[i] = aclFloatToFloat16(0.1f); } aclrtMalloc(&devInput, size_input, ACL_MEM_MALLOC_HUGE_FIRST); aclrtMalloc(&devInput_gamma, size_gamma, ACL_MEM_MALLOC_HUGE_FIRST); aclrtMalloc(&devInput_beta, size_beta, ACL_MEM_MALLOC_HUGE_FIRST); aclrtMalloc(&devOutput, size_output, ACL_MEM_MALLOC_HUGE_FIRST); //3. Transfer the operator input data to the device memory. aclrtMemcpy(devInput, size_input, input, size_input, ACL_MEMCPY_HOST_TO_DEVICE); aclrtMemcpy(devInput_gamma, size_gamma, gamma, size_gamma, ACL_MEMCPY_HOST_TO_DEVICE); aclrtMemcpy(devInput_beta, size_beta, beta, size_beta, ACL_MEMCPY_HOST_TO_DEVICE); //4. Call aclopUpdateParams to compile the operator. aclopUpdateParams("BatchNorm", 3, input_desc, 1, output_desc, nullptr, ACL_ENGINE_AICORE, ACL_COMPILE_UNREGISTERED, nullptr); //5. Call aclopExecuteV2 to load and execute the operator. aclopExecuteV2("BatchNorm", 3, input_desc, inputs, 1, output_desc, outputs, nullptr, stream); //-----User-defined function BatchNormTest(n, c, h, w) end----- //6. Obtain the operator output data. aclrtMemcpy(output, size_output, devOutput, size_output, ACL_MEMCPY_DEVICE_TO_HOST); //7. Destroy allocations in sequence. //7.1 Destroy the input and output tensor description. for (auto desc : input_desc) { aclDestroyTensorDesc(desc); } for (auto desc : output_desc) { aclDestroyTensorDesc(desc); } //7.2 Free unused memory in a timely manner. delete[]input; delete[]gamma; delete[]beta; delete[]output; //7.3 Free device memory. aclrtFree(devInput); aclrtFree(devInput_gamma); aclrtFree(devInput_beta); aclrtFree(devOutput); // 7.4 Deallocate the stream and device resources in sequence. If the stream is not created explicitly, you do not need to deallocate the resources. aclrtDestroyStream(stream); aclrtResetDevice(deviceId); aclFinalize(); |