Operator Development Based on Custom Operator Project
This section uses a simple operator as an example to describe the entire development process, including operator project creation, code compilation, build and deployment, and runtime verification. This section uses the Add operator with dynamic shape input as an example to distinguish it from the built-in Add operator, and define the operator type as AddCustom.
Project Creation
The CANN package provides the project creation tool msOpGen. You can enter the operator prototype definition file to generate an Ascend C operator development project.
- Compile the prototype definition JSON file of the AddCustom operator.Assume that the prototype definition file of the AddCustom operator is named add_custom.json and stored in $HOME/sample. The file content is as follows:
[ { "op": "AddCustom", "input_desc": [ { "name": "x", "param_type": "required", "format": [ "ND" ], "type": [ "fp16" ] }, { "name": "y", "param_type": "required", "format": [ "ND" ], "type": [ "fp16" ] } ], "output_desc": [ { "name": "z", "param_type": "required", "format": [ "ND" ], "type": [ "fp16" ] } ] } ] - Generate a development project of the AddCustom operator using msOpGen.
${INSTALL_DIR}/python/site-packages/bin/msopgen gen -i $HOME/sample/add_custom.json -c ai_core-<soc_version> -lan cpp -out $HOME/sample/AddCustom
- ${INSTALL_DIR} is the file storage path after the CANN software is installed. Replace it with the actual path.
- -i: path of the operator prototype definition file add_custom.json.
- -c: ai_core-<soc_version> indicates that the operator is executed on the AI Core. <soc_version> indicates the model of the Ascend AI Processor.
The AI processor model <soc_version> can be obtained in the following ways:
- Run the npu-smi info command on the server where the Ascend AI Processor is installed to obtain the Chip Name information. The actual value is AscendChip Name. For example, if Chip Name is xxxyy, the actual value is Ascendxxxyy.
Basic functions (operator development, build, and deployment based on the project) are applicable across operator projects created based on the same AI processor series.
- -lan: cpp indicates that the operator is developed based on the Ascend C programming framework and using the C++ programming language.
- After the command is executed, the operator project directory is generated in the $HOME/sample directory. The project contains the operator implementation template file and build script:
AddCustom ├── build.sh // Build entry script ├── cmake // Directory that stores the scripts used for operator project build and common build files ├── CMakeLists.txt // Build script of the operator project ├── CMakePresets.json // Build configuration item ├── framework // Directory of the operator plugin implementation files during AI framework adaptation ├── op_host // Implementation file on the host │ ├── add_custom_tiling.h // Operator tiling definition file │ ├── add_custom.cpp // Content file for operator prototype registration, shape derivation, information library, and tiling implementation │ ├── CMakeLists.txt ├── op_kernel // Implementation file on the kernel │ ├── CMakeLists.txt │ ├── add_custom.cpp // File for implementing the operator kernel function ├── scripts // Directory of scripts used for custom operator project packing
You simply need to pay attention to the files in bold during subsequent development.
Operator Kernel Function Implementation
Implement the kernel function of the operator in the AddCustom/op_kernel/add_custom.cpp file in the project storage directory. You can view the complete sample code in add_custom.cpp. The following describes the key implementation code.
The following figure shows the internal calling of the operator kernel function implementation code.

In addition to the initialization of the Init function, the pipeline tasks are moved-in, computed, and moved-out in Process. You can focus on the implementation of these tasks.
- First, define the kernel function and call the Init and Process functions of the operator class in the kernel function.
extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z, GM_ADDR workspace, GM_ADDR tiling) { // Obtain the tiling parameters passed by the host. GET_TILING_DATA(tiling_data, tiling); // Initialize the operator class. KernelAdd op; // Initialization function of the operator class that initializes the memory. op.Init(x, y, z, tiling_data.totalLength, tiling_data.tileNum); // Complete the core logic of operator implementation. op.Process(); } - Define the KernelAdd operator class. The members and member functions are implemented as follows:
#include "kernel_operator.h" constexpr int32_t BUFFER_NUM = 2; class KernelAdd { public: __aicore__ inline KernelAdd() {} // Initialization function, which is used to initialize the memory. __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, uint32_t totalLength, uint32_t tileNum) { // Use the obtained TilingData to compute variables such as singleCoreSize (total size of computed data on each core), tileNum (number of blocks on each core), and singleTileLength (size of each block). this->blockLength = totalLength / AscendC::GetBlockNum(); this->tileNum = tileNum; this->tileLength = this->blockLength / tileNum / BUFFER_NUM; // Obtain the start index of the current core. xGm.SetGlobalBuffer((__gm__ DTYPE_X*)x + this->blockLength * AscendC::GetBlockIdx(), this->blockLength); yGm.SetGlobalBuffer((__gm__ DTYPE_Y*)y + this->blockLength * AscendC::GetBlockIdx(), this->blockLength); zGm.SetGlobalBuffer((__gm__ DTYPE_Z*)z + this->blockLength * AscendC::GetBlockIdx(), this->blockLength); // The pipe memory management object is used to allocate memory for the input and output queues. pipe.InitBuffer(inQueueX, BUFFER_NUM, this->tileLength * sizeof(DTYPE_X)); pipe.InitBuffer(inQueueY, BUFFER_NUM, this->tileLength * sizeof(DTYPE_Y)); pipe.InitBuffer(outQueueZ, BUFFER_NUM, this->tileLength * sizeof(DTYPE_Z)); } // Core processing function, which implements the operator logic and calls the private member functions CopyIn, Compute, and CopyOut to complete the three-level pipeline operation of the vector operator. __aicore__ inline void Process() { int32_t loopCount = this->tileNum * BUFFER_NUM; for (int32_t i = 0; i < loopCount; i++) { CopyIn(i); Compute(i); CopyOut(i); } } private: // Move-in function, which is called by the core Process function to complete the processing in the CopyIn phase __aicore__ inline void CopyIn(int32_t progress) { // Allocate input tensors from the queue. AscendC::LocalTensor<DTYPE_X> xLocal = inQueueX.AllocTensor<DTYPE_X>(); AscendC::LocalTensor<DTYPE_Y> yLocal = inQueueY.AllocTensor<DTYPE_Y>(); // Copy data from GlobalTensor to LocalTensor. AscendC::DataCopy(xLocal, xGm[progress * this->tileLength], this->tileLength); AscendC::DataCopy(yLocal, yGm[progress * this->tileLength], this->tileLength); // Place LocalTesor in the VECIN queue (the logical storage location of the copy-in data in vector programming). inQueueX.EnQue(xLocal); inQueueY.EnQue(yLocal); } // Compute function, which is called by the core Process function to complete the processing in the Compute phase. __aicore__ inline void Compute(int32_t progress) { // Get the tensor from the queue for subsequent compute. AscendC::LocalTensor<DTYPE_X> xLocal = inQueueX.DeQue<DTYPE_X>(); AscendC::LocalTensor<DTYPE_Y> yLocal = inQueueY.DeQue<DTYPE_Y>(); // Allocate output tensors from the queue. AscendC::LocalTensor<DTYPE_Z> zLocal = outQueueZ.AllocTensor<DTYPE_Z>(); // Call the Add API to compute. AscendC::Add(zLocal, xLocal, yLocal, this->tileLength); // Place the compute result LocalTensor into the VECOUT queue. outQueueZ.EnQue<DTYPE_Z>(zLocal); // Release the input tensor. inQueueX.FreeTensor(xLocal); inQueueY.FreeTensor(yLocal); } // Move-out function, which is called by the core Process function to complete the processing in the CopyOut phase __aicore__ inline void CopyOut(int32_t progress) { // Get the output tensor from the VECOUT queue. AscendC::LocalTensor<DTYPE_Z> zLocal = outQueueZ.DeQue<DTYPE_Z>(); // Copy the output tensor to the GlobalTensor. AscendC::DataCopy(zGm[progress * this->tileLength], zLocal, this->tileLength); // Release the LocalTensor that is no longer used. outQueueZ.FreeTensor(zLocal); } private: // Pipe memory management object. AscendC::TPipe pipe; // Queue management object of the input data. The value of QuePosition is VECIN. AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueueX, inQueueY; // Queue management object of the output data. The value of QuePosition is VECOUT. AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> outQueueZ; // Object for managing the input and output Global Memory addresses. xGm and yGm are input, and zGm is output. AscendC::GlobalTensor<DTYPE_X> xGm; AscendC::GlobalTensor<DTYPE_Y> yGm; AscendC::GlobalTensor<DTYPE_Z> zGm; // Total size of compute data on each core uint32_t blockLength; // Total number of compute data blocks on each core uint32_t tileNum; // Size of each block uint32_t tileLength; };
Operator Implementation on the Host
After the kernel function is developed and verified, implement the function on the host, requiring the add_custom_tiling.h and add_custom.cpp files in the AddCustom/op_host directory. The following describes the key implementation of the following two files. For details about the complete sample code, see add_custom_tiling.h and add_custom.cpp.
- Modify the add_custom_tiling.h file by adding the following code in bold to define tiling parameters:
#ifndef ADD_CUSTOM_TILING_H #define ADD_CUSTOM_TILING_H #include "register/tilingdata_base.h" namespace optiling { BEGIN_TILING_DATA_DEF(TilingData) // The AddCustom operator uses two tiling parameters: totalLength and tileNum. TILING_DATA_FIELD_DEF(uint32_t, totalLength); // Total data to compute TILING_DATA_FIELD_DEF(uint32_t, tileNum); // Total number of compute data blocks on each core END_TILING_DATA_DEF; // Register the tiling to the corresponding operator. REGISTER_TILING_DATA_CLASS(AddCustom, TilingData) } #endif // ADD_CUSTOM_TILING_H - Modify the add_custom.cpp file to implement tiling.Modify the TilingFunc function to obtain the tiling context, obtain the input and output shape based on the context, set TilingData based on the shape information, save TilingData in serial mode, and set TilingKey.
namespace optiling { const uint32_t BLOCK_DIM = 8; const uint32_t TILE_NUM = 8; static ge::graphStatus TilingFunc(gert::TilingContext* context) { TilingData tiling; uint32_t totalLength = context->GetInputShape(0)->GetOriginShape().GetShapeSize(); context->SetBlockDim(BLOCK_DIM); tiling.set_totalLength(totalLength); tiling.set_tileNum(TILE_NUM); tiling.SaveToBuffer(context->GetRawTilingData()->GetData(), context->GetRawTilingData()->GetCapacity()); context->GetRawTilingData()->SetDataSize(tiling.GetDataSize()); size_t *currentWorkspace = context->GetWorkspaceSizes(1); currentWorkspace[0] = 0; return ge::GRAPH_SUCCESS; } } // namespace optiling - Implement the shape derivation of the AddCustom operator in the add_custom.cpp file.
The output shape of the Add operator is equal to the input shape. Therefore, the input shape is directly assigned to the output shape. The InferShape function in the code generated by msOpGen does not need to be modified.
- Modify the operator prototype registration in the add_custom.cpp file. This is the entry point function.
namespace ops { class AddCustom : public OpDef { public: explicit AddCustom(const char* name) : OpDef(name) { // First input of the Add operator this->Input("x") .ParamType(REQUIRED) // Required for input. .DataType({ ge::DT_FLOAT16, ge::DT_FLOAT, ge::DT_INT32 }) // Data types supported by the input .Format({ ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND }); // Data formats supported by the input // Second input of the Add operator this->Input("y") .ParamType(REQUIRED) .DataType({ ge::DT_FLOAT16, ge::DT_FLOAT, ge::DT_INT32 }) .Format({ ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND }); this->Output("z") .ParamType(REQUIRED) .DataType({ ge::DT_FLOAT16, ge::DT_FLOAT, ge::DT_INT32 }) .Format({ ge::FORMAT_ND, ge::FORMAT_ND, ge::FORMAT_ND }); // Associate with the InferShape function. this->SetInferShape(ge::InferShape); // Associate with the Tiling function. this->AICore() .SetTiling(optiling::TilingFunc); // Register the AI processor model supported by the operator. Replace it with the actual AI processor model. this->AICore().AddConfig("ascendxxx"); } }; // End the operator registration. OP_ADD(AddCustom); } // namespace ops
Operator Project Build and Deployment
Compile the AddCustom project, generate the custom operator installation package, and install it in the operator library.
- Compile the custom operator project to generate the custom operator package. Run the following commands in the operator project directory AddCustom:
./build.sh
After the project is built successfully, an OPP runfile named custom_opp_<target os>_<target architecture>.run is generated in the build_out directory, for example, custom_opp_ubuntu_x86_64.run.
- Deploy the custom operator package.
In the directory of the custom operator package, run the following command to install the OPP:
./custom_opp_<target os>_<target architecture>.run
After the command is executed successfully, related files in the custom operator package are deployed to the vendors/customize directory of the OPP in the current environment. If multiple custom operator packages are deployed, run the following command to specify a path for installing them:
./custom_opp_<target os>_<target architecture>.run --install-path=<path>
Note: If the installation directory of the OPP is specified by configuring the --install-path parameter, run the source <path>/vendors/<vendor_name>/bin/set_env.bash command before using the custom operator. In the set_env.bash script, add the installation path of the custom OPP to the environment variable ASCEND_CUSTOM_OPP_PATH for the custom operator to take effect in the current environment.
Directory structure after the deployment:├── opp // Operator library directory │ ├── built-in // Directory of the built-in operators │ ├── vendors // Directory of custom operators │ ├── config.ini │ └── vendor_name1 // Directory of the custom operator. If no path is specified, the default value customize is used. │ ├── framework // Custom operator plugin library │ ├── op_impl │ │ └── ai_core │ │ └── tbe │ │ ├── config │ │ │ └── ${soc_version} // Type of the Ascend AI Processor │ │ │ └── aic-${soc_version}-ops-info.json // Custom operator information library file │ │ ├── vendor_name1_impl // Code file for implementing the custom operator │ │ │ └── dynamic │ │ │ ├── xx.cpp │ │ │ └── xx.py │ │ ├── kernel // Custom operator binary file │ │ │ └── ${soc_version} // Type of the Ascend AI Processor │ │ │ └── config │ │ └── op_tiling │ │ ├── lib │ │ └── liboptiling.so │ └── op_proto // Directory of custom operator prototype library │ ├── inc │ │ └── op_proto.h │ └── lib │ ├── vendor_name2 // Custom operator deployed by storage vendor vendor_name2
Operator ST
The msOpST tool available in CANN Toolkit is used to generate ST cases and test your operators in the hardware environment.
This section uses the AddCustom operator as an example to describe the key execution process of the ST tool.
- Create the operator ST case definition file AddCustom_case.json. For example, save the file to the AddCustom_st directory at the same level as the AddCustom directory.The following is an example of the AddCustom_case.json file, which can be customized and modified.
[ { "case_name": "Test_AddCustom_001", "op": "AddCustom", "input_desc": [ { "format": [ "ND" ], "type": [ "float16" ], "shape": [8,2048], "data_distribute": [ "uniform" ], "value_range": [ [ 0.1, 1.0 ] ], "name": "x" }, { "format": [ "ND" ], "type": [ "float16" ], "shape": [8,2048], "data_distribute": [ "uniform" ], "value_range": [ [ 0.1, 1.0 ] ], "name": "y" } ], "output_desc": [ { "format": [ "ND" ], "type": [ "float16" ], "shape": [8,2048], "name": "z" } ] } ] - Configure the environment variables on which the ST case execution depends.The following is an example of setting environment variables. ${INSTALL_DIR} indicates the CANN software installation directory, for example, $HOME/Ascend/ascend-toolkit/latest. {arch-os} indicates the architecture and OS of the operating environment. arch indicates the OS architecture, and os indicates the operating system, for example, x86_64-linux or aarch64-linux.
export DDK_PATH=${INSTALL_DIR} export NPU_HOST_LIB=${INSTALL_DIR}/{arch-os}/devlib
Note: Modify these environment variables based on the actual installation path of the CANN package.
- Go to the directory where msOpST is located and run the following command to generate and execute test cases.
- Go to the directory where msOpST is stored.
cd $HOME/Ascend/ascend-toolkit/latest/python/site-packages/bin
- Generate a test case file and execute it.
./msopst run -i $HOME/AddCustom_st/AddCustom_case.json -soc <soc_version> -out $HOME/AddCustom_st- -i: Sets the path of the operator test case definition file (*.json). The path can be either absolute or relative.
- -soc: model of the Ascend AI Processor. Replace it with the actual model version.
- -out: path of the generated file.
After the command is executed, information similar to the following is displayed:
1 2 3 4 5 6 7
------------------------------------------------------------------------ - test case count: 1 - success count: 1 - failed count: 0 ------------------------------------------------------------------------ 2023-08-28 20:20:40 (25058) - [INFO] Process finished! 2023-08-28 20:20:40 (25058) - [INFO] The st report saved in: xxxx/AddCustom_st/20230828202015/st_report.json.
You can also view the st_report.json file in the preceding information to view the detailed running result.
- Go to the directory where msOpST is stored.