Sample Code for Operator Execution by CBLAS API Call

This section describes the key APIs and sample code for calling CBLAS operators in single-operator model execution mode.

Principles

For the API call sequence, see Single-Operator Call Sequence.

The GEMM operator (used for matrix-vector multiplication and matrix-matrix multiplication) and the Cast operator (used for data type conversion) have been encapsulated into AscendCL CBLAS APIs. For details, see CBLAS APIs. You can execute the operators in either of the following modes:

If an operator is executed in non-handle mode, the system matches the model in the memory based on the operator description in every execution.

When an operator is executed in handle mode, the system matches the operator description information with the model in the memory and caches the information in the handle. Each time the operator is executed, the operator and model do not need to be matched repeatedly. Therefore, when the same operator is executed for multiple times, the efficiency is higher. However, this mode does not support dynamic-shape operators. After the handle is used, aclopDestroyHandle needs to be called to release the handle.

Sample Code

This section uses the aclblasGemmEx API as an example. This API encapsulates the GEMM operator. In this API, the matrix multiplication calculation formula is C = αAB + βC, indicating that matrix C is obtained after matrix A and matrix B are multiplied. α and β indicate the coefficient of the product. You can click Matrix-Matrix Multiplication to view the sample.

To call the CBLAS API (encapsulating the GEMM operator), perform the following steps:

  1. Prepare the model file of the GEMM operator.
    1. Construct the description file (*.json file) of the GEMM operator, which describes the input and output tensors and operator attributes.

      Example description file of the GEMM operator:

      [
      {
        "op": "GEMM",
        "input_desc": [
          {
            "format": "ND",
            "shape": [16, 16],
            "type": "float16"
          },
          {
            "format": "ND",
            "shape": [16, 16],
            "type": "float16"
          },
          {
            "format": "ND",
            "shape": [16, 16],
            "type": "float16"
          },
          {
            "format": "ND",
            "shape": [],
            "type": "float16"
          },
          {
            "format": "ND",
            "shape": [],
            "type": "float16"
          }
        ],
        "output_desc": [
          {
            "format": "ND",
            "shape": [16, 16],
            "type": "float16"
          }
        ],
        "attr": [
        {
          "name": "transpose_a",
          "type": "bool",
          "value": false
        },
        {
          "name": "transpose_b",
          "type": "bool",
          "value": false
          }
        ]
      }
      ]
    2. Use the ATC tool to compile the operator description file into a single-operator model file (*.om file), and then call the AscendCL APIs to load the OM model file and execute the operator.

      The following is a command example of the ATC tool:

      atc --singleop=$HOME/singleop/gemm.json --output=$HOME/singleop/out/op_model --soc_version=<soc_version>

      The key parameters are described as follows. For details about the parameters, see ATC Instructions. :

      • --singleop: path of the single-operator description file (JSON format).
      • --output: directory for storing the single-operator model file.
      • --soc_version: version of Ascend AI Processor.
  2. Compile the code logic for calling the CBLAS.
    The following is a code snippet of key steps only, which is not ready to be built or run. Following the API calls, add exception handling branches and specify log printing of error and information levels. The complete code is available in the sample in Matrix-Matrix Multiplication.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    //1. Initialize AscendCL.
    aclRet = aclInit(nullptr);
    
    //2. Allocate runtime resources. (The default context and stream are used. When the default stream is used as the argument of another API, a null pointer can be passed.)
    aclRet = aclrtSetDevice(0);
    //Obtain the run mode of the software stack. Different run modes lead to different API call sequences (for example, whether data transfer is required).
    aclrtRunMode runMode;
    bool g_isDevice = false;
    aclError aclRet = aclrtGetRunMode(&runMode);
    g_isDevice = (runMode == ACL_DEVICE);
    
    //3. Set the directory of the single-operator model files.
    //This directory is relative to the directory of the executable file. For example, if the executable file is stored in the run/out directory, the directory is run/out/op_models.
    aclopSetModelDir("op_models");
    
    //4. Allocate memory.
    //Allocate device memory to store the operator inputs.
    //In this matrix-matrix multiplication example, allocate memory for storing data of matrix A, matrix B, matrix C, scalar α, and scalar β in sequence.
    aclrtMalloc((void **) &devMatrixA_, sizeA_, ACL_MEM_MALLOC_HUGE_FIRST);
    aclrtMalloc((void **) &devMatrixB_, sizeB_, ACL_MEM_MALLOC_HUGE_FIRST);
    aclrtMalloc((void **) &devMatrixC_, sizeC_, ACL_MEM_MALLOC_HUGE_FIRST);
    aclrtMalloc((void **) &devAlpha_, sizeAlphaBeta_, ACL_MEM_MALLOC_HUGE_FIRST);
    aclrtMalloc((void **) &devBeta_, sizeAlphaBeta_, ACL_MEM_MALLOC_HUGE_FIRST);
    
    //Allocate host memory. Determine whether host memory allocation is needed based on the run mode of the software stack.
    //If ACL_DEVICE is returned, which means that g_isDevice is true and the software stack runs on the device, image data transfer or data transfer within the device is not involved. In this case, host memory allocation is not needed.
    //If ACL_HOST is returned, which means that g_isDevice is false and the software stack runs on the host, image data transfer from the host to the device is involved. In this case, host memory allocation is needed.
    if (g_isDevice) {
            hostMatrixA_ = devMatrixA_;
            hostMatrixB_ = devMatrixB_;
            hostMatrixC_ = devMatrixC_;
        } else {
            aclrtMallocHost((void **) &hostMatrixA_, sizeA_);
            aclrtMallocHost((void **) &hostMatrixB_, sizeB_);
            aclrtMallocHost((void **) &hostMatrixC_, sizeC_);
        }
    
    //5. Prepare the input data. ReadFile is a user-defined function, which is used to load data from files to the memory.
    size_t fileSize;
    // Read matrix A
    char *fileData = ReadFile("test_data/data/matrix_a.bin", fileSize, hostMatrixA_, sizeA_);
    // Read matrix B
    fileData = ReadFile("test_data/data/matrix_b.bin", fileSize, hostMatrixB_, sizeB_);
    // Read matrix C
    fileData = ReadFile("test_data/data/matrix_c.bin", fileSize, hostMatrixC_, sizeC_);
    //Determine whether data transfer between the host and device is involved based on the run mode of the software stack.
    if (!g_isDevice) {
        aclError ret = aclrtMemcpy(devMatrixA_, sizeA_, hostMatrixA_, sizeA_, ACL_MEMCPY_HOST_TO_DEVICE);
        ret = aclrtMemcpy(devMatrixB_, sizeB_, hostMatrixB_, sizeB_, ACL_MEMCPY_HOST_TO_DEVICE);
        ret = aclrtMemcpy(devMatrixC_, sizeC_, hostMatrixC_, sizeC_, ACL_MEMCPY_HOST_TO_DEVICE);
    }
    
    aclrtMemcpyKind kind = g_isDevice ? ACL_MEMCPY_DEVICE_TO_DEVICE : ACL_MEMCPY_HOST_TO_DEVICE;
    ret = aclrtMemcpy(devAlpha_, sizeAlphaBeta_, hostAlpha_, sizeAlphaBeta_, kind);
    ret = aclrtMemcpy(devBeta_, sizeAlphaBeta_, hostBeta_, sizeAlphaBeta_, kind);
    
    //6. Run the single-operator.
    //In this example, aclblasGemmEx (asynchronous mode) is called to implement matrix-matrix multiplication.
    aclblasGemmEx(ACL_TRANS_N, ACL_TRANS_N, ACL_TRANS_N, m_, n_, k_,
                        devAlpha_, devMatrixA_, k_, inputType_, devMatrixB_, n_, inputType_,
                        devBeta_, devMatrixC_, n_, outputType_, ACL_COMPUTE_HIGH_PRECISION,
                        stream);
    //Call aclrtSynchronizeStream to wait for the stream tasks to complete.
    aclrtSynchronizeStream(nullptr);
    
    //7. Transfer the operator execution result. Determine whether data transfer between the host and device is involved based on the run mode of the software stack.
    if (!g_isDevice) {
            auto ret = aclrtMemcpy(hostMatrixC_, sizeC_, devMatrixC_, sizeC_, ACL_MEMCPY_DEVICE_TO_HOST);
    }
    
    //8. (Optional) Print the operator execution result to the screen.
    
    //9. Release runtime resources. (By default, the context and stream resources are automatically released using the aclrtResetDevice call.)
    aclRet = aclrtResetDevice(0);
    
    //10. Deinitialize AscendCL.
    aclRet = aclFinalize();
    
    // ......