pyBind Calling

Introduction

When training and inferring models through the PyTorch framework, many operators are called for compute, and the way these operators are called is related to the kernel compilation process. For custom operator projects, it is necessary to use the OP-Plugin operator plugin in the PyTorch Ascend Adapter to extend functionality, allowing Torch to directly call the operators in the custom operator package. For details, see PyTorch Framework. For the kernel launch open operator programming method, operator kernel implementation can be called by the PyTorch framework through pyBind adaptation.

pyBind is a library used to integrate C++ code with the Python interpreter. The implementation principle is to compile C++ code into a dynamic link library (DLL) or shared object (SO) file, and use the API provided by pyBind to bind the operator kernel function to the Python interpreter. Use the bound C++ functions, classes, and variables in the Python interpreter to implement interaction between Python and C++ code. When the pyBind module is used in kernel launch, the pyBind module is bound to the operator kernel function and encapsulated into a Python module to implement interaction between the pyBind module and the operator kernel function.

In the pyBind calling method, the following APIs are used:

Obtain the operator sample through the following links:

Environment Setup

Based on Environment Setup, you also need to install the following dependencies:

  • Install PyTorch (version 2.1.0 for example).
    // Install PyTorch in the AArch64 environment.
    pip3 install torch==2.1.0
    // Install PyTorch in the x86 environment.
    pip3 install torch==2.1.0+cpu  --index-url https://download.pytorch.org/whl/cpu
  • Install torch-npu (PyTorch 2.1.0, Python 3.9, CANN 8.0.RC1.alpha002)
     git clone https://gitee.com/ascend/pytorch.git -b v6.0.rc1.alpha002-pytorch2.1.0
     cd pytorch/
     bash ci/build.sh --python=3.9
     pip3 install dist/*.whl
  • Install pyBind11.
     pip3 install pybind11

Project Directory

Click vector operator sample to obtain a complete sample of kernel function development and runtime verification. Structure of the sample directory:

├── CppExtensions 
│   ├── add_custom_test.py      // Python calling script
│   ├── add_custom.cpp          // Operator implementation
│   ├── CMakeLists.txt          // Build project file
│   ├── pybind11.cpp            // pyBind11 function encapsulation
│   └── run.sh                  // Script for compiling and running the operator

Operator development procedure based on the operator project:

  • Complete kernel implementation of the operator.
  • Compile the operator to call the application and define the pyBind module pybind11.cpp.
  • Compile the Python calling script add_custom_test.py, including generating input data and truth data, calling encapsulated modules, and verifying results.
  • Compile the CMake build configuration file: CMakeLists.txt.

  • Modify the run.sh script for compiling and running the operator as required and execute the script to compile and run the operator and verify the result.

Operator Implementation on the Kernel

Compile the Ascend C operator implementation file by referring to Vector Programming and the operator kernel implementation in the project directory.

Operator Calling Program and pyBind Module Definition

The following code uses the add_custom operator as an example. It describes how to compile the pybind11.cpp file. When implementing your own application, you need to pay attention to modifications arising from differences in operator kernel functions, including the names of the operator kernel functions and differences in input and output parameters. The way to call the relevant APIs can be directly reused.

  1. Include header files as required. Note that the header file alcrtlaunch_{kernel_name}.h (automatically generated by the project framework) where the declaration of the corresponding kernel function calling API is located must be included. kernel_name indicates the name of the operator kernel function.
    1
    2
    3
    4
    5
    #include <pybind11/pybind11.h>
    #include <torch/extension.h>
    
    #include "aclrtlaunch_add_custom.h"
    #include "torch_npu/csrc/core/npu/NPUStream.h"
    
  2. Compile the application framework. Note that the memory of x and y in this example is allocated in add_custom_test.py of the Python calling script.
    1
    2
    3
    4
    namespace my_add {
    at::Tensor run_add_custom(const at::Tensor &x, const at::Tensor &y) {
    }
    }
    
  3. Verify the running on the NPU. Call ACLRT_LAUNCH_KERNEL to use the operator kernel function to complete the specified operation.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
      // Allocate resources and obtain the streams on the current NPU by calling the c10_npu::getCurrentNPUStream() function.
      auto acl_stream = c10_npu::getCurrentNPUStream().stream(false);
      // Allocate the output buffer on the device.
      at::Tensor z = at::empty_like(x);
      uint32_t blockDim = 8;
      uint32_t totalLength = 1;
      for (uint32_t size : x.sizes()) {
        totalLength *= size;
      }
      // Call ACLRT_LAUNCH_KERNEL to use the kernel function to complete the specified operation.
      ACLRT_LAUNCH_KERNEL(add_custom)(blockDim, acl_stream, 
                                      const_cast<void *>(x.storage().data()),
                                      const_cast<void *>(y.storage().data()),
                                      const_cast<void *>(z.storage().data()), 
                                      totalLength);
      // Copy the compute result from the device to the host and free the allocated resources.
      return z;
    
  4. Define the pyBind module to encapsulate C++ functions into Python functions. PYBIND11_MODULE is a macro in the pyBind11 library and is used to define a Python module. It takes two parameters. The first parameter is the encapsulated module name, and the second parameter is a pyBind11 module object, which is used to define functions, classes, and constants in the module. By calling the m.def() method, you can convert the my_add::run_add_custom() function in step 2 into the Python function run_add_custom so that it can be called in Python code.
    1
    2
    3
    4
    PYBIND11_MODULE(add_custom, m) { // add_custom: module name; m: module object.
      m.doc() = "add_custom pybind11 interfaces";  // optional module docstring
      m.def("run_add_custom", &my_add::run_add_custom, ""); // Bind the run_add_custom function to the pyBind module.
    }
    

Python Calling Script

In the Python calling script, use the Torch API to generate random input data and allocate memory, import the encapsulated custom module add_custom, call the run_add_custom function in add_custom, and execute operators on the NPU. For details about how to verify the running of the operator kernel function on the NPU, see Figure1 NPU-side operation verification principle.
Figure 1 NPU-side operation verification principle
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import torch
import torch_npu
from torch_npu.testing.testcase import TestCase, run_tests
import sys, os
sys.path.append(os.getcwd())
import add_custom
torch.npu.config.allow_internal_format = False
class TestCustomAdd(TestCase):
    def test_add_custom_ops(self):
        // Allocate the input buffer on the host and initialize the data.
        length = [8, 2048]
        x = torch.rand(length, device='cpu', dtype=torch.float16)
        y = torch.rand(length, device='cpu', dtype=torch.float16)
        // Allocate the input buffer on the device and copy data from the host to device.
        x_npu = x.npu()
        y_npu = y.npu()
        output = add_custom.run_add_custom(x_npu, y_npu)
        cpuout = torch.add(x, y)
        self.assertRtolEqual(output, cpuout)
if __name__ == "__main__":
    run_tests()

Compiling the CMake Build Configuration File

Generally, you do not need to modify the compilation configuration files, but understanding these files can help you better understand the principles of compilation and customize CMake as needed. For details, see Compiling the CMake Build Configuration File.

Modifying and Executing the Script for One-Click Compilation and Running

You can refer to the one-click script run.sh provided in the sample to quickly compile and run the Ascend C operator on the NPU. The one-click compilation and running script provide the following functions.

Figure 2 Process of operator compiling and running in one-click mode

The one-click compilation and running script provided in the sample does not apply to all operator runtime verification scenarios. Modify the script based on the actual situation.

  • Compile your script for generating input and truth values based on the algorithm principles of the Ascend C operator.

After compiling the preceding files, you can run script for one-click compilation and running.

The following table describes script parameters and how to execute the script:
bash run.sh --soc-version=<soc_version> 
bash run.sh -v <soc_version> 
Table 1 Script parameters

Parameter

Abbreviation

Description

--soc-version

-v

Model of the AI processor where the operator runs.

NOTE:

The AI processor model can be obtained in the following ways:

  • Run the npu-smi info command on the server where the Ascend AI Processor is installed to obtain the Chip Name information. The actual value is AscendChip Name. For example, if Chip Name is xxxyy, the actual value is Ascendxxxyy.
The following models are supported:
  • Atlas Training Series Product