PyTorch
When PyTorch is used for model training and inference, many operators are called for computation. To integrate and deploy your custom operators in PyTorch, the following methods are available:
- Kernel launch: By registering custom operators through torch.library or Pybind, you enable PyTorch to call the operator kernel program.
- Single-operator API calling: For details about the adaptation plugin development process and examples in this mode, see "OpPlugin-based Operator Adaptation" in Ascend Extension for PyTorch Feature Guide.
- Graph mode calling: For details about the adaptation development guide for custom operators in PyTorch graph mode, see "Integrating Custom Operators into a Graph" in PyTorch Graph Mode User Guide (TorchAir).

This section describes how to register a custom operator through torch.library and Pybind and enable PyTorch to call the operator kernel program.
- torch.library provides a set of APIs for extending PyTorch's core operator library. It allows you to create new operators and provide custom implementations for them.
- Pybind is an open-source tool that bridges C++ and Python, enabling seamless integration of C++ code into the Python environment.
Pybind is used to quickly expose C++ functions to Python for efficient API binding. However, the generated operators cannot be identified by the PyTorch operator system and do not support schema definition or graph tracing capability. Therefore, torch.compile is not supported. In contrast, torch.library provides a mechanism for in-depth integration with PyTorch's core operator system and supports operator registration, schema definition, and graph tracing, which are prerequisites for supporting torch.compile. You can choose the method that best fits your needs.
torch.library
The following code uses the add_custom operator as an example to describe how to call the operator kernel program through torch.library. Only the core steps are described here. For details about the complete sample, see torch.library sample.
- Set up the environment.In addition to installing the CANN software package by referring to Environment Setup, you also need to install the following dependencies:
- Implement the custom operator on the NPU.
Implement the operator on the kernel and use <<<>>> to call the operator's kernel function to complete the specified computation. In the sample, the c10_npu::getCurrentNPUStream API is called to obtain the current NPU stream. The return type is NPUStream. For details about how to use this API, see section "(beta) c10_npu::getCurrentNPUStream" in Ascend Extension for PyTorch Custom API Reference.
Note that the memory for inputs x and y in this sample is allocated in the outer Python calling script.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
namespace ascendc_ops { at::Tensor ascendc_add(const at::Tensor &x, const at::Tensor &y) { // Allocate runtime resources and obtain the current NPU stream by calling c10_npu::getCurrentNPUStream(). auto aclStream = c10_npu::getCurrentNPUStream().stream(false); // Allocate the output buffer on the device. at::Tensor z = at::empty_like(x); uint32_t blockDim = 8; uint32_t totalLength = 1; for (uint32_t size : x.sizes()) { totalLength *= size; } // Call the kernel function using <<<>>> to complete the specified computation. auto xGm = static_cast<uint8_t *>(const_cast<void *>(x.storage().data())); auto yGm = static_cast<uint8_t *>(const_cast<void *>(y.storage().data())); auto zGm = static_cast<uint8_t *>(const_cast<void *>(z.storage().data())); add_custom<<<blockDim, nullptr, aclStream>>>(xGm, yGm, zGm, totalLength); // Copy the computation result from the device to the host and free the allocated resources. return z; } } // namespace ascendc_ops
- Register the custom operator.PyTorch provides the TORCH_LIBRARY macro as the core API for registering custom operators. This macro is used to create and initialize a custom operator library. After registration, the custom operator can be called in Python using torch.ops.namespace.op_name. TORCH_LIBRARY_IMPL is used to bind the operator logic to a specific DispatchKey (PyTorch's device scheduling identifier). For NPU devices, the operator implementation needs to be registered with the DispatchKey PrivateUse1.
1 2 3 4 5 6 7 8 9 10 11
// Register the operator with torch.library. TORCH_LIBRARY(ascendc_ops, m) { m.def("ascendc_add(Tensor x, Tensor y) -> Tensor"); } // Register the implementation with PrivateUse1 on the NPU device. TORCH_LIBRARY_IMPL(ascendc_ops, PrivateUse1, m) { m.impl("ascendc_add", TORCH_FN(ascendc_ops::ascendc_add)); }
- Compile and generate a dynamic operator library.
- Use the Python test script to perform the test.
In add_custom_test.py, the generated custom operator library is loaded using torch.ops.load_library, the registered ascendc_add function is called, and the output on the NPU is compared with the CPU's standard addition result to verify the numerical correctness of the custom operator.
Pybind
The following code uses the add_custom operator as an example to describe how to call a custom operator in a PyTorch script using Pybind. Only the core steps are described here. For details about the complete sample, see Pybind sample.
- Set up the environment.In addition to installing the CANN software package by referring to Environment Setup, you also need to install the following dependencies:
- pybind11
pip3 install pybind11
- Implement the custom operator on the NPU.
Implement the operator on the kernel and use <<<>>> to call the operator's kernel function to complete the specified computation. In the sample, the c10_npu::getCurrentNPUStream API is called to obtain the current NPU stream. The return type is NPUStream. For details about how to use this API, see section "(beta) c10_npu::getCurrentNPUStream" in Ascend Extension for PyTorch Custom API Reference.
Note that the memory for inputs x and y in this sample is allocated in the Python calling script.1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
// Header files required for calling Pybind and PyTorch #include <pybind11/pybind11.h> #include <torch/extension.h> #include "torch_npu/csrc/core/npu/NPUStream.h" // Header file required for implementation on the kernel #include "kernel_operator.h" ... namespace my_add { at::Tensor run_add_custom(const at::Tensor &x, const at::Tensor &y) { // Allocate runtime resources and obtain the current NPU stream by calling c10_npu::getCurrentNPUStream(). auto aclStream = c10_npu::getCurrentNPUStream().stream(false); // Allocate the output buffer on the device. at::Tensor z = at::empty_like(x); uint32_t blockDim = 8; uint32_t totalLength = 1; for (uint32_t size : x.sizes()) { totalLength *= size; } // Call the kernel function using <<<>>> to complete the specified computation. auto xGm = static_cast<uint8_t *>(const_cast<void *>(x.storage().data())); auto yGm = static_cast<uint8_t *>(const_cast<void *>(y.storage().data())); auto zGm = static_cast<uint8_t *>(const_cast<void *>(z.storage().data())); add_custom<<<blockDim, nullptr, aclStream>>>(xGm, yGm, zGm, totalLength); // Copy the computation result from the device to the host and free the allocated resources. return z; } } // namespace my_add
- Define a Pybind module to encapsulate C++ functions into Python functions. PYBIND11_MODULE is a macro in pybind11 and is used to define a Python module. It takes two parameters. The first parameter is the encapsulated module name, and the second parameter is a pybind11 module object, which is used to define functions, classes, constants, and other entities in the module. By calling the m.def() method, you can convert the my_add::run_add_custom() function from the previous step into the Python function run_add_custom, so that it can be called in Python code.
1 2 3 4
PYBIND11_MODULE(add_custom, m) { // add_custom: module name; m: module object. m.doc() = "add_custom pybind11 interfaces"; // optional module docstring m.def("run_add_custom", &my_add::run_add_custom, ""); // Bind the run_add_custom function to the Pybind module. }
- Compile and generate a dynamic operator library.
- In the Python calling script, use the Torch API to generate random input data and allocate memory. Then import the encapsulated custom module add_custom, call the run_add_custom function in add_custom, and execute the operator on the NPU.