Sample Code for Calling NN or Fused Operator APIs

This section describes how to call, compile, and run the NN operator and fusion operator APIs in single-operator API execution mode.

Principles

For details about the NN operator and fusion operator in single-operator API execution mode, see Single-Operator Calling Modes.

This type of operator interface is divided into two segments, as shown in the following figure.

1
2
aclnnStatus aclnnXxxGetWorkspaceSize(const aclTensor *src, ..., aclTensor *out, ..., uint64_t *workspaceSize, aclOpExecutor **executor);
aclnnStatus aclnnXxx(void *workspace, uint64_t workspaceSize, aclOpExecutor *executor, aclrtStream stream);

Call the first API aclnnXxxGetWorkspaceSize to calculate the workspace size required for the current API call. After obtaining the workspace size required for the current calculation, apply for the NPU memory based on the workspace size, and then call the second API aclnnXxx to perform the calculation. Xxx indicates the operator type, for example, Add.

For details about all NN and fusion operator APIs provided by CANN, see Group Management.

For details about the two-segment API call sequence, see API Call Sequence of Single-Operator API Execution.

Sample Code

The following uses the Add operator as an example to describe the basic logic for calling two-phase operator APIs. The processes of calling other operators are similar. Modify the code based on your actual need.

The sample code is for reference only. For details about the operator calling example, see Group Management.

Note that the operator must be built and run on the supported product models. Otherwise, the operator fails to be called.

The Add operator implements tensor addition. The calculation formula is y = x1 + αxx2. You can use the following sample code for reference and name the file test_add.cpp:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
#include <iostream>
#include <vector>
#include "acl/acl.h"
#include "aclnnop/aclnn_add.h"

#define CHECK_RET(cond, return_expr) \
  do {                               \
    if (!(cond)) {                   \
      return_expr;                   \
    }                                \
  } while (0)

#define LOG_PRINT(message, ...)     \
  do {                              \
    printf(message, ##__VA_ARGS__); \
  } while (0)

int64_t GetShapeSize(const std::vector<int64_t>& shape) {
  int64_t shape_size = 1;
  for (auto i : shape) {
    shape_size *= i;
  }
  return shape_size;
}

int Init(int32_t deviceId, aclrtStream* stream) {
  // (Fixed writing) Initialize AscendCL.
  auto ret = aclInit(nullptr);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclInit failed. ERROR: %d\n", ret); return ret);
  ret = aclrtSetDevice(deviceId);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtSetDevice failed. ERROR: %d\n", ret); return ret);
  ret = aclrtCreateStream(stream);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtCreateStream failed. ERROR: %d\n", ret); return ret);
  return 0;
}

template <typename T>
int CreateAclTensor(const std::vector<T>& hostData, const std::vector<int64_t>& shape, void** deviceAddr,
                    aclDataType dataType, aclTensor** tensor) {
  auto size = GetShapeSize(shape) * sizeof(T);
  // Call aclrtMalloc to allocate memory on the device.
  auto ret = aclrtMalloc(deviceAddr, size, ACL_MEM_MALLOC_HUGE_FIRST);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtMalloc failed. ERROR: %d\n", ret); return ret);
  // Call aclrtMemcpy to copy the data on the host to the memory on the device.
  ret = aclrtMemcpy(*deviceAddr, size, hostData.data(), size, ACL_MEMCPY_HOST_TO_DEVICE);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtMemcpy failed. ERROR: %d\n", ret); return ret);
  // Compute the strides of consecutive tensors.
  std::vector<int64_t> strides(shape.size(), 1);
  for (int64_t i = shape.size() - 2; i >= 0; i--) {
    strides[i] = shape[i + 1] * strides[i + 1];
  }
  // Call aclCreateTensor to create an ACL tensor.
  *tensor = aclCreateTensor(shape.data(), shape.size(), dataType, strides.data(), 0, aclFormat::ACL_FORMAT_ND,
                            shape.data(), shape.size(), *deviceAddr);
  return 0;
}

int main() {
  // 1. (Fixed writing) Initialize the device and stream. For details, see the list of external AscendCL APIs.
  // Set deviceId based on the actual device.
  int32_t deviceId = 0;
  aclrtStream stream;
  auto ret = Init(deviceId, &stream);
  // Use check as required.
  CHECK_RET(ret == 0, LOG_PRINT("Init acl failed. ERROR: %d\n", ret); return ret);

  // 2. Construct the input and output based on the API.
  std::vector<int64_t> selfShape = {4, 2};
  std::vector<int64_t> otherShape = {4, 2};
  std::vector<int64_t> outShape = {4, 2};
  void* selfDeviceAddr = nullptr;
  void* otherDeviceAddr = nullptr;
  void* outDeviceAddr = nullptr;
  aclTensor* self = nullptr;
  aclTensor* other = nullptr;
  aclScalar* alpha = nullptr;
  aclTensor* out = nullptr;
  std::vector<float> selfHostData = {0, 1, 2, 3, 4, 5, 6, 7};
  std::vector<float> otherHostData = {1, 1, 1, 2, 2, 2, 3, 3};
  std::vector<float> outHostData = {0, 0, 0, 0, 0, 0, 0, 0};
  float alphaValue = 1.2f;
  // Create a self aclTensor.
  ret = CreateAclTensor(selfHostData, selfShape, &selfDeviceAddr, aclDataType::ACL_FLOAT, &self);
  CHECK_RET(ret == ACL_SUCCESS, return ret);
  // Create other aclTensor.
  ret = CreateAclTensor(otherHostData, otherShape, &otherDeviceAddr, aclDataType::ACL_FLOAT, &other);
  CHECK_RET(ret == ACL_SUCCESS, return ret);
  // create alpha aclScalar.
  alpha = aclCreateScalar(&alphaValue, aclDataType::ACL_FLOAT);
  CHECK_RET(alpha != nullptr, return ret);
  // Create out aclTensor.
  ret = CreateAclTensor(outHostData, outShape, &outDeviceAddr, aclDataType::ACL_FLOAT, &out);
  CHECK_RET(ret == ACL_SUCCESS, return ret);

  // 3.  Call the CANN operator library API, which needs to be changed to a specific operator API.
  uint64_t workspaceSize = 0;
  aclOpExecutor* executor;
  // Call the first-phase API of aclnnAdd.
  ret = aclnnAddGetWorkspaceSize(self, other, alpha, out, &workspaceSize, &executor);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclnnAddGetWorkspaceSize failed. ERROR: %d\n", ret); return ret);
  // Allocate device memory based on workspaceSize calculated by the first-phase API.
  void* workspaceAddr = nullptr;
  if (workspaceSize > 0) {
    ret = aclrtMalloc(&workspaceAddr, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST);
    CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("allocate workspace failed. ERROR: %d\n", ret); return ret;);
  }
  // Call the second-phase API of aclnnAdd.
  ret = aclnnAdd(workspaceAddr, workspaceSize, executor, stream);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclnnAdd failed. ERROR: %d\n", ret); return ret);

 // 4. Wait until the task execution is complete. This code is written in a fixed format.
  ret = aclrtSynchronizeStream(stream);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtSynchronizeStream failed. ERROR: %d\n", ret); return ret);

  // 5. Obtain the output value and copy the result from the device memory to the host. Modify the configuration based on the API definition.
  auto size = GetShapeSize(outShape);
  std::vector<float> resultData(size, 0);
  ret = aclrtMemcpy(resultData.data(), resultData.size() * sizeof(resultData[0]), outDeviceAddr, size * sizeof(float),
                    ACL_MEMCPY_DEVICE_TO_HOST);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("copy result from device to host failed. ERROR: %d\n", ret); return ret);
  for (int64_t i = 0; i < size; i++) {
    LOG_PRINT("result[%ld] is: %f\n", i, resultData[i]);
  }

  // 6. Release aclTensor and aclScalar. Modify the configuration based on the API definition.
  aclDestroyTensor(self);
  aclDestroyTensor(other);
  aclDestroyScalar(alpha);
  aclDestroyTensor(out);
 
  // 7. Release device resources. Modify the configuration based on the API definition.
  aclrtFree(selfDeviceAddr);
  aclrtFree(otherDeviceAddr);
  aclrtFree(outDeviceAddr);
  if (workspaceSize > 0) {
    aclrtFree(workspaceAddr);
  }
  aclrtDestroyStream(stream);
  aclrtResetDevice(deviceId);
  aclFinalize();
  return 0;
}

CMakeLists File (DLL)

Take the compilation of the Add operator as an example. The CMake file is defined as follows. For details about the dependent dynamic library files, see Dependent Header Files and Library Files. Modify the files as required.

  • The operators that implement fusion and parallelism of collective communication and MatMul computation are called MC2 operators, such as AllGatherMatmul, AlltoAllAllGatherBatchMatMul, BatchMatMulReduceScatterAlltoAll, MatMulAllReduce, MatMulAllReduceAddRmsNorm, and MatMulReduceScatter.
  • When an MC2 operator API is called, multithreading and Huawei Collective Communication Library (HCCL) are involved. Therefore, the following content needs to be imported to the CMAKE file. Otherwise, the build fails.
    # Set the link library file paths.
    find_package(Threads REQUIRED)                        
    target_link_libraries(opapi_test PRIVATE
                          ${ASCEND_PATH}/lib64/libascendcl.so
                          ${ASCEND_PATH}/lib64/libnnopbase.so
                          ${ASCEND_PATH}/lib64/libopapi.so
                          ${ASCEND_PATH}/lib64/libhccl.so      # The HCCL file
                          ${CMAKE_THREAD_LIBS_INIT})            # The library file on which multithreading depends

    find_package(Threads REQUIRED) is a command used by CMake to search for the thread library. It can automatically link the header files that the thread library depends on and other library files that the thread library indirectly depends on.

# CMake lowest version requirement
cmake_minimum_required(VERSION 3.14)

# Set the project name.
project(ACLNN_EXAMPLE)

# Compile options
add_compile_options(-std=c++11)

# Set compilation options.
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY  "./bin")    
set(CMAKE_CXX_FLAGS_DEBUG "-fPIC -O0 -g -Wall")
set(CMAKE_CXX_FLAGS_RELEASE "-fPIC -O2 -Wall")

# Set the executable file name (for example, opapi_test) and specify the directory where the operator file (*.cpp) is stored.
add_executable(opapi_test
               test_add.cpp) 

# Set ASCEND_PATH (CANN package directory) and INCLUDE_BASE_DIR (header file directory).
if(NOT "$ENV{ASCEND_CUSTOM_PATH}" STREQUAL "")      
    set(ASCEND_PATH $ENV{ASCEND_CUSTOM_PATH})
else()
    set(ASCEND_PATH "/usr/local/Ascend/ascend-toolkit/latest")
endif()
set(INCLUDE_BASE_DIR "${ASCEND_PATH}/include")
include_directories(
    ${INCLUDE_BASE_DIR}
    ${INCLUDE_BASE_DIR}/aclnn
)

# Set the link library file paths.
target_link_libraries(opapi_test PRIVATE
                      ${ASCEND_PATH}/lib64/libascendcl.so
                      ${ASCEND_PATH}/lib64/libnnopbase.so
                      ${ASCEND_PATH}/lib64/libopapi.so)

# The executable file is in the bin folder of the directory where the CMakeLists file is located.
install(TARGETS opapi_test DESTINATION ${CMAKE_RUNTIME_OUTPUT_DIRECTORY})

CMakeLists File (Static Library)

Take the compilation of the Add operator as an example. The CMake file is defined as follows. For details about the dependent static library files, see Dependent Header Files and Library Files. Modify the files as required.

  • The operators that implement fusion and parallelism of collective communication and MatMul computation are called MC2 operators, such as AllGatherMatmul, AlltoAllAllGatherBatchMatMul, BatchMatMulReduceScatterAlltoAll, MatMulAllReduce, MatMulAllReduceAddRmsNorm, and MatMulReduceScatter.
  • When an MC2 operator API is called in API execution mode, multithreading and HCCL are involved. Therefore, the following content needs to be imported to the CMAKE file. Otherwise, the build fails.
    # Set the link library file paths.
    find_package(Threads REQUIRED)   
    target_link_libraries(opapi_test PRIVATE
                          aclnn_rand_static aclnn_math_static aclnn_ops_infer_static aclnn_ops_train_static
                          opmaster_static c_sec platform error_manager ascendalog profapi ascendcl ge_common_base 
    graph_base exe_graph graph register ascend_dump nnopbase hccl ${CMAKE_THREAD_LIBS_INIT} dl runtime)   # Add library files on which collective communication and multithreading depend.

    find_package(Threads REQUIRED) is a command used by CMake to search for the thread library. It can automatically link the header files that the thread library depends on and other library files that the thread library indirectly depends on.

# CMake lowest version requirement
cmake_minimum_required(VERSION 3.14)

# Set the project name.
project(ACLNN_EXAMPLE)

# Compile options
add_compile_options(-std=c++11)

# Set compilation options.
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY  "./bin")    
set(CMAKE_CXX_FLAGS_DEBUG "-fPIC -O0 -g -Wall")
set(CMAKE_CXX_FLAGS_RELEASE "-fPIC -O2 -Wall")

# Set the executable file name (for example, opapi_test) and specify the directory where the operator file (*.cpp) is stored.
add_executable(opapi_test
               test_add.cpp) 

# Set ASCEND_PATH (CANN package directory) and INCLUDE_BASE_DIR (header file directory).
if(NOT "$ENV{ASCEND_CUSTOM_PATH}" STREQUAL "")      
    set(ASCEND_PATH $ENV{ASCEND_CUSTOM_PATH})
else()
    set(ASCEND_PATH "/usr/local/Ascend/ascend-toolkit/latest")
endif()
set(INCLUDE_BASE_DIR "${ASCEND_PATH}/include")
include_directories(
    ${INCLUDE_BASE_DIR}
    ${INCLUDE_BASE_DIR}/aclnn
)

# Set the link library file paths.
#Note 1: opmaster_static.a and aclnn_math_static.a are mandatory. Set other .a files as required. You can set one or more .a files.
#Note 2: The .so file is the dynamic library file on which the .a file of the static library depends and must be set.
target_link_directories(opapi_test PRIVATE ${ASCEND_PATH}/lib64/)
target_link_libraries(opapi_test PRIVATE
                      aclnn_rand_static aclnn_math_static aclnn_ops_infer_static aclnn_ops_train_static
                      opmaster_static c_sec platform error_manager ascendalog profapi ascendcl ge_common_base 
graph_base exe_graph graph register ascend_dump nnopbase)

# The executable file is in the bin folder of the directory where the CMakeLists file is located.
install(TARGETS opapi_test DESTINATION ${CMAKE_RUNTIME_OUTPUT_DIRECTORY})

Build and Run

  • This section uses the scenario where the development and operating environments are co-deployed as an example. That is, the machine with Ascend AI Processor is used as both the development environment and operating environment. In this scenario, code development and code running are performed on the same machine. For details about environment setup, see Development and Operating Environment Setup.
    • Build phase: The compilation of the dynamic or static library depends on the development kit package Ascend-cann-toolkit and operator binary software package Ascend-cann-kernels.
    • Running phase: The dynamic library and static library cannot be used together. For the static library, you only need to install the offline inference engine package Ascend-cann-nnrt.
  • For details about how to build and run apps, see "App Build and Run" in App Debugging.
  1. Prepare the operator calling code (*.cpp) and compilation script (CMakeLists.txt) in advance based on Sample Code, CMakeLists File (DLL), or CMakeLists File (Static Library).
  2. Set the environment variable.

    After installing the CANN software, log in to the environment as the CANN running user and run the following command to make the environment variables take effect:

    source ${INSTALL_DIR}/set_env.sh

    In the preceding information, Replace ${INSTALL_DIR} with the actual CANN component directory. If the Ascend-CANN-Toolkit package is installed as the root user, the CANN component directory is /usr/local/Ascend/ascend-toolkit/latest. indicates

  3. Build and run the script.
    1. Go to the directory where CMakeLists.txt is stored and run the following command to create the build folder to store the generated build file.
      mkdir -p build 
    2. Go to the directory where build is located, run the cmake command for build, and then run the make command to generate an executable file.
      cd build
      cmake ../ -DCMAKE_CXX_COMPILER=g++ -DCMAKE_SKIP_RPATH=TRUE
      make

      After the build is successful, the opapi_test executable file is generated in the bin folder of the build directory.

    3. Go to the bin folder and run the executable file opapi_test.
      cd bin
      ./opapi_test

      Take the running result of the Add operator as an example. The running result is as follows: