Single-Operator Call Sequence

This section describes the single-operator calling mode and API call sequence.

If you need to execute a single-operator during app development, see API Call Sequence to understand the overall process and then view the procedure described in this section.

For details about the operators supported by the system, see "Ascend IR Operator Specifications" in Operator Library API Reference.

For operators that are not supported by the system, you need to develop custom operators by referring to Ascend C Operator Development Guide.

Single-Operator Calling: Single-Operator API Execution, Single-Operator Model Execution, and Kernel Loading and Execution

Single-operator API execution: provides a set of C API execution operators for operator calling. There is no need to provide the Intermediate Representation (IR) definition. Generally, these C APIs are defined as two-phase APIs. See the following example:

        
             aclnnStatus aclxxXxxGetWorkspaceSize(const aclTensor *src, ..., aclTensor *out, ..., uint64_t *workspaceSize, aclOpExecutor **executor);
aclnnStatus aclxxXxx(void *workspace, uint64_t workspaceSize, aclOpExecutor *executor, aclrtStream stream);

The first-phase API aclxxXxxGetWorkspaceSize must be called to calculate the workspace size required for the current API call. Then allocate the NPU memory based on the obtained workspace size and call the second-phase API aclxxXxx to execute the operator. aclxx indicates the operator API prefix, for example, aclnn. Xxx indicates the operator type, for example, operator Add.

The operator APIs that can be called are classified into the following categories (for details, see Single-Operator API Execution):

Math operators: mathematical calculation operators, such as Add and Abs. The API prefix is aclnnXxx.
NN operators: neural network operators, such as matmul. The API prefix is aclnnXxx.
CV operators: computer vision operators, such as GridSample. The API prefix is aclnnXxx.
Transformer operators: operators for foundation model computing, such as operators for FlashAttention, MC2 (merged compute and communication) and MoE (Mixture of Experts). The API prefix is aclnnXxx.
DVPP operators: Digital Vision Pre-Processing operators, such as the operators for high-performance video/image encoding and decoding, as well as image cropping and resizing. The API prefix is acldvppXxx.

Single-operator model execution: Operator execution is based on graph IR. First, compile the operator (for example, use the ATC tool to compile the single-operator description file defined by Ascend IR into an operator .om model file). Then, call an acl API (for example, aclopSetModelDir) to load the operator model. Finally, call an acl API (for example, aclopExecuteV2) to execute the operator.
Kernel loading and execution: Operator execution is based on the operator kernel. Call aclrtBinaryLoadFromFile to load the operator binary file (*.o file), call aclrtLaunchKernelWithConfig to launch the kernel, and call aclrtBinaryUnLoad to unload the operator binary.

API Call Sequence of Single-Operator API Execution

Figure 1 API call sequence of single-operator API execution

The key APIs are described as follows:

Perform initialization.

Call aclInit for initialization.
Allocate runtime resources.

Allocate runtime resources in sequence. For details, see Runtime Resource Allocation and Deallocation.
Allocate and transfer data memory.
1. Call aclrtMalloc to allocate device memory to store the input and output data of the operator.
2. Call APIs, such as aclCreateTensor and aclCreateIntArray, to construct the input and output data of the operators such as aclTensor and aclIntArray. For details about the APIs, see Single-Operator API Execution.
To transfer data from the host to the device, call aclrtMemcpy (synchronous mode) or aclrtMemcpyAsync (asynchronous mode) to copy the memory.
Calculate the workspace and execute the operator.
1. Call aclxxXxxGetWorkspaceSize to obtain the argument of the operator and calculate the workspace size required for executing the operator.
2. Call aclrtMalloc to allocate device memory based on the workspace size.
3. Call aclxxXxx to perform calculation and obtain the result.
Call aclrtSynchronizeStream to block the app until all tasks in the specified stream are complete.
Call aclrtFree to free the memory.
Call aclrtMemcpy (synchronous mode) or aclrtMemcpyAsync (asynchronous mode) to transfer data from the device to the host using memory copy and then free the memory.
Deallocate runtime resources.
1. Call APIs, such as aclDestroyTensor and aclDestroyIntArray, to destroy the input and output of the operator. For details about the APIs, see Single-Operator API Execution.
2. After all data is destroyed, deallocate runtime resources in sequence. For details, see Runtime Resource Allocation and Deallocation.
Perform deinitialization.
Call aclFinalize for deinitialization.

API Call Sequence of Single-Operator Model Execution

Figure 2 API call sequence of single-operator model execution
Click to enlarge

The key APIs are described as follows:

Compile an operator.
Operators can be compiled in either of the following modes:
- After an operator is compiled, the operator data is saved in the .om file.
  In this mode, you need to use the ATC tool to compile the operator. For details, see ATC Instructions. Compile the single-operator definition file (*.json) into an offline model adapted to the Ascend AI Processor (*.om file).
  
  After the compilation, perform 2, 3, 4, 5, 6, and 7 in sequence.
- After an operator is compiled, the operator data is saved in memory.
  In this mode, you need to call acl APIs as required.
  - For the operators that will be executed for multiple times, you are advised to call aclopCompile to compile the operators. After the compilation, perform 3, 4, 5, 6, and 7 in sequence.
  - For the operators that will be compiled and executed for the same number of times, you are advised to perform 3 and then call aclopCompileAndExecute. After the compilation, perform 6 and 7 in sequence.
Load the operator model file.
You can use either of the following methods:
- Call aclopSetModelDir to set the directory of the single-operator .om model file.
- Call aclopLoad to load the single-operator model data from memory. The memory is managed by the user. The "single-operator model data" refers to the data that is loaded to the memory from the single-operator .om file.
Call aclrtMalloc to allocate device memory to store the input and output data of the operator.
To transfer data from the host to the device, call aclrtMemcpy (synchronous mode) or aclrtMemcpyAsync (asynchronous mode) to copy the memory.
In the dynamic-shape scenario, if the output shape of an operator cannot be determined, you need to infer or estimate the output shape of the operator before executing the operator.
You need to call aclopInferShape, aclGetTensorDescNumDims, aclGetTensorDescDimV2, and aclGetTensorDescDimRange to infer or estimate the output shape of the operator as the input of the operator execution API aclopExecuteV2.
Execute the operator.
- Operators encapsulated as acl APIs (for details, see CBLAS APIs), including the GEMM operator and Cast operator, can be executed in either of the following ways:
  - Non-handle mode: Call APIs whose names do not contain keyword "Handle", for example, aclblasGemmEx (with the GEMM operator encapsulated) and aclopCast (with the Cast operator encapsulated).
  - Handle mode: Call APIs whose names contain keyword "Handle", for example, aclblasCreateHandleForGemmEx and aclopCreateHandleForCast. Also call aclopExecWithHandle to execute the operators.
- Operators that are not encapsulated as acl APIs, can be executed in either of the following ways:
  - Non-handle mode: Call aclopExecuteV2.
  - Handle mode: Call aclopCreateHandle to create a handle, and then call aclopExecWithHandle.
If an operator is executed in non-handle mode, the system matches the model in the memory based on the operator description in every execution.

When an operator is executed in handle mode, the system matches the operator description information with the model in the memory and caches the information in the handle. Each time the operator is executed, the operator and model do not need to be matched repeatedly. Therefore, when the same operator is executed for multiple times, the efficiency is higher. However, this mode does not support dynamic-shape operators. After the handle is used, aclopDestroyHandle needs to be called to release the handle.
Call aclrtSynchronizeStream to block the app until all tasks in the specified stream are complete.
Call aclrtFree to free the memory.
Call aclrtMemcpy (synchronous mode) or aclrtMemcpyAsync (asynchronous mode) to transfer data from the device to the host using memory copy and then free the memory.

API Call Sequence for Kernel Loading and Execution

Click to enlarge

The major steps are as follows:

Call aclInit for initialization.
For details, see Initialization and Deinitialization.
Allocate runtime resources. Call aclrtSetDevice to specify the compute device and call aclrtCreateStream to create a stream.
For details, see Runtime Resource Allocation and Deallocation.
Call aclrtBinaryLoadFromFile to load the operator binary file.
AI CPU operators also support the mode of loading operator binary data from the memory by calling aclrtBinaryLoadFromData. After the operator binary data is loaded, call aclrtRegisterCpuFunc to register the AI CPU operators.
Call aclrtBinaryGetFunctionByEntry or aclrtBinaryGetFunction to obtain the kernel function handle.
(Optional) Operate the parameter list according to the kernel function handle. The operations are as follows:
1. Initialize the parameter list.
  Currently, the memory can be managed by the system (by calling aclrtKernelArgsInit) or by users (by calling aclrtKernelArgsInitByUserMem).
2. Append parameters and update parameter values.
  The kernel function parameter list contains parameters of different types, such as pointer, placeholder, and uint8_t parameters.
  - Pointer parameter: Its value is a device memory address. Generally, the input and output of an operator are parameters of this type. You need to call the device memory allocation API (for example, aclrtMalloc) in advance to allocate memory and copy data to the device.
  - Placeholder: A placeholder is also a pointer parameter. The difference is that you do not need to manually copy the parameter data to the device. Instead, this operation is completed by the Runtime. The Runtime does not fill the actual device address when a parameter is appended, but fills it only during kernel launch. That is where the placeholder comes in. For non-input and non-output parameters of an operator, you can use a placeholder to combine the host-to-device copies of small data (< 2 KB recommended) into one copy during kernel launch, thus reducing the number of copy operations and improving performance.
  You can call different parameter appending APIs for different types of parameters.
  - For a placeholder parameter, the associated memory must be placed after all parameters. Therefore, when appending a placeholder parameter, call aclrtKernelArgsAppendPlaceHolder to set a placeholder. After all parameters are appended, call aclrtKernelArgsGetPlaceHolderBuffer to obtain the memory address to which the placeholder points. You can manage the data in the memory based on the obtained memory address.
  - For a non-placeholder parameter (such as a pointer parameter or an uint8_t parameter), call aclrtKernelArgsAppend to copy the user-defined parameter value to the parameter data area to which argsHandle points. To update the parameter value, call aclrtKernelArgsParaUpdate.
  Note that the kernel function parameter list may contain multiple parameters, and parameters of different types may appear alternately. Therefore, you need to append parameters from left to right according to the parameter sequence in the parameter list. A maximum of 128 parameters can be appended.
3. End the parameter appending and parameter value update.
  After all parameters are appended, call aclrtKernelArgsFinalize to indicate that the parameters are assembled. After aclrtKernelArgsFinalize is called, the parameter values can still be updated. Then, aclrtKernelArgsFinalize needs to be called again.
Call the Launch Kernel API to start the compute task of the corresponding operator.
If the aclrtArgsHandle parameter list handle is used to assemble the input data of the kernel function, call aclrtLaunchKernelWithConfig to start the compute task of the corresponding operator. In this mode, you only need to append parameters to the parameter list in sequence. You do not need to pay attention to the assembly details in the memory or inner parameters.

If the input data of the kernel function is stored in the host or device memory, call aclrtLaunchKernel, aclrtLaunchKernelV2, or aclrtLaunchKernelWithHostArgs to start the compute task of the corresponding operator.
Call aclrtBinaryUnLoad to unload the operator binary file.
Deallocate runtime resources. Call aclrtDestroyStream to destroy streams and call aclrtResetDevice to release resources on the device.
For details, see Runtime Resource Allocation and Deallocation.
Call aclFinalize for deinitialization.
For details, see Initialization and Deinitialization.

Parent topic: Single-Operator Calling