pyBind Calling
Introduction
When training and inferring models through the PyTorch framework, many operators are called for compute, and the way these operators are called is related to the kernel compilation process. For custom operator projects, it is necessary to use the OP-Plugin operator plugin in the PyTorch Ascend Adapter to extend functionality, allowing Torch to directly call the operators in the custom operator package. For details, see PyTorch Framework. For the kernel launch open operator programming method, operator kernel implementation can be called by the PyTorch framework through pyBind adaptation.
pyBind is a library used to integrate C++ code with the Python interpreter. The implementation principle is to compile C++ code into a dynamic link library (DLL) or shared object (SO) file, and use the API provided by pyBind to bind the operator kernel function to the Python interpreter. Use the bound C++ functions, classes, and variables in the Python interpreter to implement interaction between Python and C++ code. When the pyBind module is used in kernel launch, the pyBind module is bound to the operator kernel function and encapsulated into a Python module to implement interaction between the pyBind module and the operator kernel function.
In the pyBind calling method, the following APIs are used:
- c10_npu::getCurrentNPUStream: Obtains the current NPU stream. The return type is NPUStream. For details, see c10_npu::getCurrentNPUStream.
- ACLRT_LAUNCH_KERNEL: Same as the ACLRT_LAUNCH_KERNEL API in Kernel Launch Method.
Obtain the operator sample through the following links:
Environment Setup
Based on Environment Setup, you also need to install the following dependencies:
- Install PyTorch (version 2.1.0 for example).
// Install PyTorch in the AArch64 environment. pip3 install torch==2.1.0
// Install PyTorch in the x86 environment. pip3 install torch==2.1.0+cpu --index-url https://download.pytorch.org/whl/cpu
- Install torch-npu (PyTorch 2.1.0, Python 3.9, CANN 8.0.RC1.alpha002)
git clone https://gitee.com/ascend/pytorch.git -b v6.0.rc1.alpha002-pytorch2.1.0 cd pytorch/ bash ci/build.sh --python=3.9 pip3 install dist/*.whl
- Install pyBind11.
pip3 install pybind11
Project Directory
Click vector operator sample to obtain a complete sample of kernel function development and runtime verification. Structure of the sample directory:
├── CppExtensions │ ├── add_custom_test.py // Python calling script │ ├── add_custom.cpp // Operator implementation │ ├── CMakeLists.txt // Build project file │ ├── pybind11.cpp // pyBind11 function encapsulation │ └── run.sh // Script for compiling and running the operator
Operator development procedure based on the operator project:
- Complete kernel implementation of the operator.
- Compile the operator to call the application and define the pyBind module pybind11.cpp.
- Compile the Python calling script add_custom_test.py, including generating input data and truth data, calling encapsulated modules, and verifying results.
- Modify the run.sh script for compiling and running the operator as required and execute the script to compile and run the operator and verify the result.
Operator Implementation on the Kernel
Compile the Ascend C operator implementation file by referring to Vector Programming and the operator kernel implementation in the project directory.
Operator Calling Program and pyBind Module Definition
The following code uses the add_custom operator as an example. It describes how to compile the pybind11.cpp file. When implementing your own application, you need to pay attention to modifications arising from differences in operator kernel functions, including the names of the operator kernel functions and differences in input and output parameters. The way to call the relevant APIs can be directly reused.
- Include header files as required. Note that the header file alcrtlaunch_{kernel_name}.h (automatically generated by the project framework) where the declaration of the corresponding kernel function calling API is located must be included. kernel_name indicates the name of the operator kernel function.
1 2 3 4 5
#include <pybind11/pybind11.h> #include <torch/extension.h> #include "aclrtlaunch_add_custom.h" #include "torch_npu/csrc/core/npu/NPUStream.h"
- Compile the application framework. Note that the memory of x and y in this example is allocated in add_custom_test.py of the Python calling script.
1 2 3 4
namespace my_add { at::Tensor run_add_custom(const at::Tensor &x, const at::Tensor &y) { } }
- Verify the running on the NPU. Call ACLRT_LAUNCH_KERNEL to use the operator kernel function to complete the specified operation.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
// Allocate resources and obtain the streams on the current NPU by calling the c10_npu::getCurrentNPUStream() function. auto acl_stream = c10_npu::getCurrentNPUStream().stream(false); // Allocate the output buffer on the device. at::Tensor z = at::empty_like(x); uint32_t blockDim = 8; uint32_t totalLength = 1; for (uint32_t size : x.sizes()) { totalLength *= size; } // Call ACLRT_LAUNCH_KERNEL to use the kernel function to complete the specified operation. ACLRT_LAUNCH_KERNEL(add_custom)(blockDim, acl_stream, const_cast<void *>(x.storage().data()), const_cast<void *>(y.storage().data()), const_cast<void *>(z.storage().data()), totalLength); // Copy the compute result from the device to the host and free the allocated resources. return z;
- Define the pyBind module to encapsulate C++ functions into Python functions. PYBIND11_MODULE is a macro in the pyBind11 library and is used to define a Python module. It takes two parameters. The first parameter is the encapsulated module name, and the second parameter is a pyBind11 module object, which is used to define functions, classes, and constants in the module. By calling the m.def() method, you can convert the my_add::run_add_custom() function in step 2 into the Python function run_add_custom so that it can be called in Python code.
1 2 3 4
PYBIND11_MODULE(add_custom, m) { // add_custom: module name; m: module object. m.doc() = "add_custom pybind11 interfaces"; // optional module docstring m.def("run_add_custom", &my_add::run_add_custom, ""); // Bind the run_add_custom function to the pyBind module. }
Python Calling Script
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
import torch import torch_npu from torch_npu.testing.testcase import TestCase, run_tests import sys, os sys.path.append(os.getcwd()) import add_custom torch.npu.config.allow_internal_format = False class TestCustomAdd(TestCase): def test_add_custom_ops(self): // Allocate the input buffer on the host and initialize the data. length = [8, 2048] x = torch.rand(length, device='cpu', dtype=torch.float16) y = torch.rand(length, device='cpu', dtype=torch.float16) // Allocate the input buffer on the device and copy data from the host to device. x_npu = x.npu() y_npu = y.npu() output = add_custom.run_add_custom(x_npu, y_npu) cpuout = torch.add(x, y) self.assertRtolEqual(output, cpuout) if __name__ == "__main__": run_tests() |
Compiling the CMake Build Configuration File
Generally, you do not need to modify the compilation configuration files, but understanding these files can help you better understand the principles of compilation and customize CMake as needed. For details, see Compiling the CMake Build Configuration File.
Modifying and Executing the Script for One-Click Compilation and Running
You can refer to the one-click script run.sh provided in the sample to quickly compile and run the Ascend C operator on the NPU. The one-click compilation and running script provide the following functions.
The one-click compilation and running script provided in the sample does not apply to all operator runtime verification scenarios. Modify the script based on the actual situation.
- Compile your script for generating input and truth values based on the algorithm principles of the Ascend C operator.
After compiling the preceding files, you can run script for one-click compilation and running.
bash run.sh --soc-version=<soc_version> bash run.sh -v <soc_version>
|
Parameter |
Abbreviation |
Description |
|---|---|---|
|
--soc-version |
-v |
Model of the AI processor where the operator runs.
NOTE:
The AI processor model can be obtained in the following ways:
The following models are supported:
|
