Sample Code for Operator Execution by CBLAS API Call

This section describes the key APIs and sample code for calling CBLAS operators in single-operator model execution mode.

Principles

For details about the API call sequence, see Single-Operator Call Sequence.

The GEMM operator (used for matrix-vector multiplication and matrix-matrix multiplication) and the Cast operator (used for data type conversion) have been encapsulated into aclblas APIs. For details, see CBLAS APIs. You can execute the operators in either of the following modes:

Non-handle mode: Call APIs whose names do not contain keyword "Handle", for example, aclblasGemmEx (with the GEMM operator encapsulated) and aclopCast (with the Cast operator encapsulated).
Handle mode: Call APIs whose names contain keyword "Handle", for example, aclblasCreateHandleForGemmEx and aclopCreateHandleForCast. Also call aclopExecWithHandle to execute the operators.

If an operator is executed in non-handle mode, the system matches the model in the memory based on the operator description in every execution.

When an operator is executed in handle mode, the system matches the operator description information with the model in the memory and caches the information in the handle. Each time the operator is executed, the operator and model do not need to be matched repeatedly. Therefore, when the same operator is executed for multiple times, the efficiency is higher. However, this mode does not support dynamic-shape operators. After the handle is used, aclopDestroyHandle needs to be called to release the handle.

Sample Code

This section uses aclblasGemmEx as an example. This API encapsulates the GEMM operator. The matrix multiplication formula is C = αAB + βC, which means that matrix A is multiplied by matrix B to obtain matrix C. α and β indicate the coefficients of the product. You can click Matrix-Matrix Multiplication to view the sample.

To call the CBLAS API (encapsulating the GEMM operator), perform the following steps:

Prepare the model file of the GEMM operator.
1. Construct the description file (*.json file) of the GEMM operator, which describes the input and output tensors and operator attributes.
  Example description file of the GEMM operator:
```
[
{
  "op": "GEMM",
  "input_desc": [
    {
      "format": "ND",
      "shape": [16, 16],
      "type": "float16"
    },
    {
      "format": "ND",
      "shape": [16, 16],
      "type": "float16"
    },
    {
      "format": "ND",
      "shape": [16, 16],
      "type": "float16"
    },
    {
      "format": "ND",
      "shape": [],
      "type": "float16"
    },
    {
      "format": "ND",
      "shape": [],
      "type": "float16"
    }
  ],
  "output_desc": [
    {
      "format": "ND",
      "shape": [16, 16],
      "type": "float16"
    }
  ],
  "attr": [
  {
    "name": "transpose_a",
    "type": "bool",
    "value": false
  },
  {
    "name": "transpose_b",
    "type": "bool",
    "value": false
    }
  ]
}
]
```
2. Use the ATC tool to compile the operator description file into a single-operator model file (*.om file), and then call acl APIs to load the OM model file and execute the operator.
  The following is a command example of the ATC tool:
```
atc --singleop=$HOME/singleop/gemm.json --output=$HOME/singleop/out/op_model --soc_version=<soc_version>
```
  The key parameters are described below. For details about the parameters, see ATC Instructions. :
  - --singleop: path of the single-operator description file (JSON format).
  - --output: directory for storing the single-operator model file.
  - --soc_version: version of the Ascend AI Processor. Replace <soc_version> with the actual value.
    To check soc_version of a device, perform the following steps:
    - For the following products: Run the npu-smi info command on the server where Ascend AI Processor is installed to obtain the Name information. The actual value is AscendName. For example, if Name is xxxyy, the actual value is Ascendxxxyy.
      Atlas A2 training products / Atlas A2 inference products
      
      Atlas 200I/500 A2 inference products
      
      Atlas inference products
      
      Atlas training products
    - For the following products: Run the npu-smi info -t board -i id -c chip_id command on the server where Ascend AI Processor is installed to obtain the Chip Name and NPU Name information. The actual value is Chip Name_NPU Name. For example, if the value of Chip Name is Ascendxxx and the value of NPU Name is 1234, the actual value is Ascendxxx_1234. Note that:
      - id: device ID, which is the NPU ID obtained by running the npu-smi info -l command.
      - chip_id: chip ID, which is obtained by running the npu-smi info -m command.
      Atlas A3 training products / Atlas A3 inference products

Compile the code logic for calling the CBLAS.

The following is a code snippet of key steps only, which is not ready to be built or run. Following the API calls, add exception handling branches and specify log printing of error and information levels. The complete code is available in the sample in Matrix-Matrix Multiplication.

         
          
            
            
              //1. Perform initialization.
aclRet = aclInit(nullptr);

//2. Allocate runtime resources. (The default context and stream are used. When the default stream is used as the argument of another API, a null pointer can be passed.)
aclRet = aclrtSetDevice(0);
//Obtain the run mode of the software stack. Different run modes lead to different API call sequences (for example, whether data transfer is required).
aclrtRunMode runMode;
bool g_isDevice = false;
aclError aclRet = aclrtGetRunMode(&runMode);
g_isDevice = (runMode == ACL_DEVICE);

//3. Set the directory of the single-operator model files.
//This directory is relative to the directory of the executable file. For example, if the executable file is stored in the run/out directory, the directory is run/out/op_models.
aclopSetModelDir("op_models");

//4. Allocate memory.
//Allocate device memory to store the operator inputs.
//In this matrix-matrix multiplication example, allocate memory for storing data of matrix A, matrix B, matrix C, scalar α, and scalar β in sequence.
aclrtMalloc((void **) &devMatrixA_, sizeA_, ACL_MEM_MALLOC_HUGE_FIRST);
aclrtMalloc((void **) &devMatrixB_, sizeB_, ACL_MEM_MALLOC_HUGE_FIRST);
aclrtMalloc((void **) &devMatrixC_, sizeC_, ACL_MEM_MALLOC_HUGE_FIRST);
aclrtMalloc((void **) &devAlpha_, sizeAlphaBeta_, ACL_MEM_MALLOC_HUGE_FIRST);
aclrtMalloc((void **) &devBeta_, sizeAlphaBeta_, ACL_MEM_MALLOC_HUGE_FIRST);

//Allocate host memory. Determine whether host memory allocation is needed based on the run mode of the software stack.
//If ACL_DEVICE is returned, which means that g_isDevice is true and the software stack runs on the device, image data transfer or data transfer within the device is not involved. In this case, host memory allocation is not needed.
//If ACL_HOST is returned, which means that g_isDevice is false and the software stack runs on the host, image data transfer from the host to the device is involved. In this case, host memory allocation is needed.
if (g_isDevice) {
        hostMatrixA_ = devMatrixA_;
        hostMatrixB_ = devMatrixB_;
        hostMatrixC_ = devMatrixC_;
    } else {
        aclrtMallocHost((void **) &hostMatrixA_, sizeA_);
        aclrtMallocHost((void **) &hostMatrixB_, sizeB_);
        aclrtMallocHost((void **) &hostMatrixC_, sizeC_);
    }

//5. Prepare the input data. ReadFile is a user-defined function, which is used to load data from files to the memory.
size_t fileSize;
// Read matrix A
char *fileData = ReadFile("test_data/data/matrix_a.bin", fileSize, hostMatrixA_, sizeA_);
// Read matrix B
fileData = ReadFile("test_data/data/matrix_b.bin", fileSize, hostMatrixB_, sizeB_);
// Read matrix C
fileData = ReadFile("test_data/data/matrix_c.bin", fileSize, hostMatrixC_, sizeC_);
//Determine whether data transfer between the host and device is involved based on the run mode of the software stack.
if (!g_isDevice) {
    aclError ret = aclrtMemcpy(devMatrixA_, sizeA_, hostMatrixA_, sizeA_, ACL_MEMCPY_HOST_TO_DEVICE);
    ret = aclrtMemcpy(devMatrixB_, sizeB_, hostMatrixB_, sizeB_, ACL_MEMCPY_HOST_TO_DEVICE);
    ret = aclrtMemcpy(devMatrixC_, sizeC_, hostMatrixC_, sizeC_, ACL_MEMCPY_HOST_TO_DEVICE);
}

aclrtMemcpyKind kind = g_isDevice ? ACL_MEMCPY_DEVICE_TO_DEVICE : ACL_MEMCPY_HOST_TO_DEVICE;
ret = aclrtMemcpy(devAlpha_, sizeAlphaBeta_, hostAlpha_, sizeAlphaBeta_, kind);
ret = aclrtMemcpy(devBeta_, sizeAlphaBeta_, hostBeta_, sizeAlphaBeta_, kind);

//6. Run the single-operator.
//In this example, aclblasGemmEx (asynchronous mode) is called to implement matrix-matrix multiplication.
aclblasGemmEx(ACL_TRANS_N, ACL_TRANS_N, ACL_TRANS_N, m_, n_, k_,
                    devAlpha_, devMatrixA_, k_, inputType_, devMatrixB_, n_, inputType_,
                    devBeta_, devMatrixC_, n_, outputType_, ACL_COMPUTE_HIGH_PRECISION,
                    stream);
// Call aclrtSynchronizeStream to block the host processing until all tasks in the specified stream are complete.
aclrtSynchronizeStream(nullptr);

//7. Transfer the operator execution result. Determine whether data transfer between the host and device is involved based on the run mode of the software stack.
if (!g_isDevice) {
        auto ret = aclrtMemcpy(hostMatrixC_, sizeC_, devMatrixC_, sizeC_, ACL_MEMCPY_DEVICE_TO_HOST);
}

//8. (Optional) Print the operator execution result to the screen.

//9. Release runtime resources. (By default, the context and stream resources are automatically released using the aclrtResetDevice call.)
aclRet = aclrtResetDevice(0);

//10. Perform deinitialization.
aclRet = aclFinalize();

// ......

             

           

         
        

Parent topic: Single-Operator Model Execution