Sample Code for Calling aclnn APIs

This section describes the samples of CANN operator API calling, build, and running in single-operator API execution mode.

Prerequisites

  • Environment preparation: The app development environment has been deployed, and the dependency header files and library files have been correctly referenced. If not, the APIs cannot be used. For details, see Dependent Header Files and Library Files.
  • Basic knowledge: You have learned the execution principles of single-operator APIs and the call sequence of CANN operator APIs (aclnnXxx). For details, see Single-Operator Call Sequence.
  • API reference: You have learned the functions, parameters, and constraints of CANN operator APIs (aclnnXxx). For details, see Single-Operator API Execution.

Sample Code

The following uses the Add operator as an example to describe the basic logic of calling a two-phase operator. The processes of calling other operators are similar. Modify the code based on your actual need.

  • The sample code is for reference only. Refer to the sample code provided in the API document.
  • Note that the operator must be built and run on the supported product models. Otherwise, the operator fails to be called.

The Add operator implements tensor addition. The calculation formula is y = x1 + αxx2. You can use the following sample code for reference and name the file test_add.cpp:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
#include <iostream>
#include <vector>
#include "acl/acl.h"
#include "aclnnop/aclnn_add.h"

#define CHECK_RET(cond, return_expr) \
  do {                               \
    if (!(cond)) {                   \
      return_expr;                   \
    }                                \
  } while (0)

#define LOG_PRINT(message, ...)     \
  do {                              \
    printf(message, ##__VA_ARGS__); \
  } while (0)

int64_t GetShapeSize(const std::vector<int64_t>& shape) {
  int64_t shape_size = 1;
  for (auto i : shape) {
    shape_size *= i;
  }
  return shape_size;
}

int Init(int32_t deviceId, aclrtStream* stream) {
  //(Fixed writing) Perform initialization.
  auto ret = aclInit(nullptr);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclInit failed. ERROR: %d\n", ret); return ret);
  ret = aclrtSetDevice(deviceId);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtSetDevice failed. ERROR: %d\n", ret); return ret);
  ret = aclrtCreateStream(stream);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtCreateStream failed. ERROR: %d\n", ret); return ret);
  return 0;
}

template <typename T>
int CreateAclTensor(const std::vector<T>& hostData, const std::vector<int64_t>& shape, void** deviceAddr,
                    aclDataType dataType, aclTensor** tensor) {
  auto size = GetShapeSize(shape) * sizeof(T);
  // Call aclrtMalloc to allocate memory on the device.
  auto ret = aclrtMalloc(deviceAddr, size, ACL_MEM_MALLOC_HUGE_FIRST);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtMalloc failed. ERROR: %d\n", ret); return ret);
  // Call aclrtMemcpy to copy the data on the host to the memory on the device.
  ret = aclrtMemcpy(*deviceAddr, size, hostData.data(), size, ACL_MEMCPY_HOST_TO_DEVICE);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtMemcpy failed. ERROR: %d\n", ret); return ret);
  // Compute the strides of the contiguous tensor.
  std::vector<int64_t> strides(shape.size(), 1);
  for (int64_t i = shape.size() - 2; i >= 0; i--) {
    strides[i] = shape[i + 1] * strides[i + 1];
  }
  // Call aclCreateTensor to create an ACL Tensor.
  *tensor = aclCreateTensor(shape.data(), shape.size(), dataType, strides.data(), 0, aclFormat::ACL_FORMAT_ND,
                            shape.data(), shape.size(), *deviceAddr);
  return 0;
}

int main() {
  //1. (Fixed writing) Initialize the device/stream.
  // Set deviceId based on the actual device.
  int32_t deviceId = 0;
  aclrtStream stream;
  auto ret = Init(deviceId, &stream);
  // Use check as required.
  CHECK_RET(ret == 0, LOG_PRINT("Init acl failed. ERROR: %d\n", ret); return ret);

  //2. Construct the input and output based on the API.
  std::vector<int64_t> selfShape = {4, 2};
  std::vector<int64_t> otherShape = {4, 2};
  std::vector<int64_t> outShape = {4, 2};
  void* selfDeviceAddr = nullptr;
  void* otherDeviceAddr = nullptr;
  void* outDeviceAddr = nullptr;
  aclTensor* self = nullptr;
  aclTensor* other = nullptr;
  aclScalar* alpha = nullptr;
  aclTensor* out = nullptr;
  std::vector<float> selfHostData = {0, 1, 2, 3, 4, 5, 6, 7};
  std::vector<float> otherHostData = {1, 1, 1, 2, 2, 2, 3, 3};
  std::vector<float> outHostData = {0, 0, 0, 0, 0, 0, 0, 0};
  float alphaValue = 1.2f;
  // Create a self aclTensor.
  ret = CreateAclTensor(selfHostData, selfShape, &selfDeviceAddr, aclDataType::ACL_FLOAT, &self);
  CHECK_RET(ret == ACL_SUCCESS, return ret);
  // Create other aclTensor.
  ret = CreateAclTensor(otherHostData, otherShape, &otherDeviceAddr, aclDataType::ACL_FLOAT, &other);
  CHECK_RET(ret == ACL_SUCCESS, return ret);
  // create alpha aclScalar.
  alpha = aclCreateScalar(&alphaValue, aclDataType::ACL_FLOAT);
  CHECK_RET(alpha != nullptr, return ret);
  // Create out aclTensor.
  ret = CreateAclTensor(outHostData, outShape, &outDeviceAddr, aclDataType::ACL_FLOAT, &out);
  CHECK_RET(ret == ACL_SUCCESS, return ret);

  //3.  Call the CANN operator library API, which needs to be changed to a specific operator API.
  uint64_t workspaceSize = 0;
  aclOpExecutor* executor;
  // Call the first-phase API of aclnnAdd.
  ret = aclnnAddGetWorkspaceSize(self, other, alpha, out, &workspaceSize, &executor);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclnnAddGetWorkspaceSize failed. ERROR: %d\n", ret); return ret);
  // Allocate device memory based on workspaceSize calculated by the first-phase API.
  void* workspaceAddr = nullptr;
  if (workspaceSize > 0) {
    ret = aclrtMalloc(&workspaceAddr, workspaceSize, ACL_MEM_MALLOC_HUGE_FIRST);
    CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("allocate workspace failed. ERROR: %d\n", ret); return ret;);
  }
  // Call the second-phase API of aclnnAdd.
  ret = aclnnAdd(workspaceAddr, workspaceSize, executor, stream);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclnnAdd failed. ERROR: %d\n", ret); return ret);

 //4. Wait until the task execution is complete. This code is written in a fixed format.
  ret = aclrtSynchronizeStream(stream);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("aclrtSynchronizeStream failed. ERROR: %d\n", ret); return ret);

  //5. Obtain the output value and copy the result from the device memory to the host. Modify the configuration based on the API definition.
  auto size = GetShapeSize(outShape);
  std::vector<float> resultData(size, 0);
  ret = aclrtMemcpy(resultData.data(), resultData.size() * sizeof(resultData[0]), outDeviceAddr, size * sizeof(float),
                    ACL_MEMCPY_DEVICE_TO_HOST);
  CHECK_RET(ret == ACL_SUCCESS, LOG_PRINT("copy result from device to host failed. ERROR: %d\n", ret); return ret);
  for (int64_t i = 0; i < size; i++) {
    LOG_PRINT("result[%ld] is: %f\n", i, resultData[i]);
  }

  //6. Release aclTensor and aclScalar. Modify the configuration based on the API definition.
  aclDestroyTensor(self);
  aclDestroyTensor(other);
  aclDestroyScalar(alpha);
  aclDestroyTensor(out);
 
  //7. Release device resources. Modify the configuration based on the API definition.
  aclrtFree(selfDeviceAddr);
  aclrtFree(otherDeviceAddr);
  aclrtFree(outDeviceAddr);
  if (workspaceSize > 0) {
    aclrtFree(workspaceAddr);
  }
  aclrtDestroyStream(stream);
  aclrtResetDevice(deviceId);
  aclFinalize();
  return 0;
}

CMakeLists File

Take the Add operator as an example. The CMake file is defined below. Modify the file based on the site requirements.

  • The operators that implement fusion and parallelism of collective communication and MatMul computation are called MC2 operators, such as AllGatherMatmul, AlltoAllAllGatherBatchMatMul, BatchMatMulReduceScatterAlltoAll, MatMulAllReduce, MatMulAllReduceAddRmsNorm, and MatMulReduceScatter.
  • When an MC2 operator API is called, multithreading and Huawei Collective Communication Library (HCCL) are involved. Therefore, the following content needs to be imported to the CMake file. Otherwise, the build fails.
    1
    2
    3
    4
    5
    6
    7
    8
    #Set the link library file paths.
    find_package(Threads REQUIRED)                        
    target_link_libraries(opapi_test PRIVATE
                          ${ASCEND_PATH}/lib64/libascendcl.so
                          ${ASCEND_PATH}/lib64/libnnopbase.so
                          ${ASCEND_PATH}/lib64/libopapi.so
                          ${ASCEND_PATH}/lib64/libhccl.so      #The HCCL file
                          ${CMAKE_THREAD_LIBS_INIT})            #The library file on which multithreading depends
    

    find_package(Threads REQUIRED) is a command used by CMake to search for the thread library. It can automatically link the header files that the thread library depends on and other library files that the thread library indirectly depends on.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# CMake lowest version requirement
cmake_minimum_required(VERSION 3.14)

# Set the project name.
project(ACLNN_EXAMPLE)

# Compile options
add_compile_options(-std=c++11)

# Set compilation options.
set(CMAKE_RUNTIME_OUTPUT_DIRECTORY  "./bin")    
set(CMAKE_CXX_FLAGS_DEBUG "-fPIC -O0 -g -Wall")
set(CMAKE_CXX_FLAGS_RELEASE "-fPIC -O2 -Wall")

# Set the executable file name (for example, opapi_test) and specify the directory where the operator file (*.cpp) is stored.
add_executable(opapi_test
               test_add.cpp) 

# Set ASCEND_PATH (CANN package directory) and INCLUDE_BASE_DIR (header file directory).
if(NOT "$ENV{ASCEND_CUSTOM_PATH}" STREQUAL "")      
    set(ASCEND_PATH $ENV{ASCEND_CUSTOM_PATH})
else()
    set(ASCEND_PATH "/usr/local/Ascend/cann")
endif()
set(INCLUDE_BASE_DIR "${ASCEND_PATH}/include")
include_directories(
    ${INCLUDE_BASE_DIR}
    ${INCLUDE_BASE_DIR}/aclnn
)

#Set the link library file paths.
target_link_libraries(opapi_test PRIVATE
                      ${ASCEND_PATH}/lib64/libascendcl.so
                      ${ASCEND_PATH}/lib64/libnnopbase.so
                      ${ASCEND_PATH}/lib64/libopapi.so)

# The executable file is in the bin folder of the directory where the CMakeLists file is located.
install(TARGETS opapi_test DESTINATION ${CMAKE_RUNTIME_OUTPUT_DIRECTORY})

${ASCEND_PATH}/lib64/libopapi.so is the library file that all aclnn APIs depend on. To improve the operator compilation and execution efficiency, you can reference the library file by operator type as needed: libopapi_math.so for Math operators, libopapi_nn.so for NN operators, libopapi_cv.so for CV operators, and libopapi_transformer.so for Transformer operators.

Build and Run

  • In this example, the development and operating environments are co-deployed, where the server with the Ascend AI Processor is used as both the development environment and operating environment. In this scenario, code development and code running are performed on the same machine.
  • Environment requirements: The compilation process depends on Toolkit (the CANN development kit) and the package of operator ops. Ensure that the Toolkit and package have been installed. For details about the installation guide, see Environment Setup.
  • For details about how to build and run apps, see "App Build and Run" in App Debugging.
  1. Prepare the operator calling code (*.cpp) and compilation script (CMakeLists.txt) in advance based on Sample Code or CMakeLists File.
  2. Set the environment variable.

    After installing the CANN software, log in to the environment as the CANN running user and run the following command to make the environment variables take effect:

    source /usr/local/Ascend/cann/set_env.sh
  3. Build and run the script.
    1. Go to the directory where CMakeLists.txt is stored and run the following command to create the build folder to store the generated build file.
      mkdir -p build 
    2. Go to the directory where build is located, run the cmake command for build, and then run the make command to generate an executable file.
      1
      2
      3
      cd build
      cmake ../ -DCMAKE_CXX_COMPILER=g++ -DCMAKE_SKIP_RPATH=TRUE
      make
      

      After the build is successful, the opapi_test executable file is generated in the bin folder of the current directory.

    3. Go to the bin folder and run the executable file opapi_test.
      cd bin
      ./opapi_test

      Take the running result of the Add operator as an example. The running result is as follows: