Operator Implementation on the Kernel

Operator Implementation has described the operator implementation on the kernel. This section focuses on the programming mode and API usage when the CANN framework is connected.

Automatically Generating the Operator Implementation Template on the Kernel

Implement the kernel function of the operator in the op_kernel/xxx.cpp file in the operator project directory. The definition template of the kernel function has been automatically generated by msOpGen. An example is as follows: Note that the parameters are arranged in the sequence of input, output, workspace, and tiling. Do not adjust the sequence.

1
2
3
4
5
#include "kernel_operator.h"
extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z, GM_ADDR workspace, GM_ADDR tiling) {
    GET_TILING_DATA(tiling_data, tiling);// Obtain tiling parameters. For details, see the following description.
    // TODO: user kernel impl
}
If the input and output in the operator prototype definition have the same name, the ref suffix is added to the output parameters in the automatically generated kernel function for differentiation. The following is an example:
1
2
3
extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR x_ref, GM_ADDR workspace, GM_ADDR tiling) {
    ...
}

GET_TILING_DATA (Obtaining Tiling Parameters)

GET_TILING_DATA is provided for obtaining the tiling information input by the kernel entry point function of the operator and fills the information in the registered tiling structure. This function is compiled in macro expansion mode. Note that the TilingData structure needs to be defined in the corresponding operator host implementation to implement and register the Tiling function for TilingData compute. For details, see Tiling Implementation on the Host.

The following is an example of calling GET_TILING_DATA in the kernel function to obtain TilingData:
1
2
3
4
5
6
7
8
9
extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z, GM_ADDR workspace, GM_ADDR tiling)
{
    GET_TILING_DATA(tilingData, tiling);
    KernelAdd op;
    op.Init(x, y, z, tilingData.totalLength, tilingData.tileNum);
    if (TILING_KEY_IS(1)) {
        op.Process();
    }
}

Deriving Input Data Types and Formats in Kernel Functions

The operator project provides three macros in the kernel function: DTYPE_<Arg>, ORIG_DTYPE_<Arg>, and FORMAT_<Arg>. These macros are used to deduce the data type, original data type, and data format of the arguments of the kernel function. <Arg> is automatically capitalized. The sample code is as follows:
1
2
3
4
5
6
7
8
9
template<class T> func() {}
extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z, GM_ADDR workspace, GM_ADDR tiling)
{
    DTYPE_X temp;
    func<DTYPE_Z>();
    if (FORMAT_Y == FORMAT_ND) {
        ...
    }
}

Implementing Operator Whose Output Shape Depends on Computation on the Kernel

For NonZero (counting the number of non-zero values in a tensor), the output shape information of the operator cannot be obtained before the computation is complete. When defining the prototype of this type of operator, you need to use the OutputShapeDependOnCompute API to identify the operator, and write the actual output shape to the output parameters in the operator kernel function so that the framework can manage the output memory based on the information.

Add an output parameter of the GM_ADDR type to the end of all outputs of the kernel function. After the kernel function is computed, write the output shape information to the output parameter. The format of the shape information is as follows. The size is n * (8 + 1), and the data type of each element is uint64_t, where n indicates the number of outputs of the shape information to be updated. The first element is used to store the actual shape dimension (dim), and the subsequent eight elements are used to store the shape information of each dimension.

  • The output sequence is the same as that defined in the prototype.
  • For the uint64_t output data type (for tensors), the upper bits of uint32_t of the dim must be set to 1, indicating that the tensor is parsed as the uint64_t type.
  • In the following example, an output of the operator is computed. The data type of the output tensor is uint32_t. After the computation is complete, the output shape is (32, 64). The output parameter shape_out is used to store the shape information and its value is (2, 32, 64). A code example is as follows:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    extern "C" __global__ __aicore__ void xxx_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z, GM_ADDR shape_out, GM_ADDR workspace, GM_ADDR tiling) {
    ...
        constexpr uint32_t SHAPEOUT_SIZE = 9;
        // The output data is 2-dimensional ([32, 64]), and the tensor type is uint32_t.
        GlobalTensor<uint64_t> shapeoutGlobal_uint32;
        shapeoutGlobal_uint32.SetGlobalBuffer((__gm__ uint64_t*)shape_out, SHAPEOUT_SIZE);
        shapeoutGlobal_uint32.SetValue(0, 2);
        shapeoutGlobal_uint32.SetValue(1, 32);
        shapeoutGlobal_uint32.SetValue(2, 64);
    ...
    }
    
  • In the following example, an output of the operator is computed. The data type of the output tensor is uint64_t. After the computation is complete, the output shape is (32, 64). The output parameter shape_out is used to store the shape information and its value is (0x0000000010000000 | 2, 32, 64). A code example is as follows:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    extern "C" __global__ __aicore__ void xxx_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z, GM_ADDR shape_out, GM_ADDR workspace, GM_ADDR tiling) {
    ...
        constexpr uint32_t SHAPEOUT_SIZE = 9;
        // The output data is 4-dimensional ([1, 64, 128, 128]) and the tensor type is uint64_t.
        GlobalTensor<uint64_t> shapeoutGlobal_uint64;
        shapeoutGlobal_uint64.SetGlobalBuffer((__gm__ uint64_t*)shape_out, SHAPEOUT_SIZE);
        shapeoutGlobal_uint64.SetValue(0, 0x0000000010000000 | 4);
        shapeoutGlobal_uint64.SetValue(1, 1);
        shapeoutGlobal_uint64.SetValue(2, 64);
        shapeoutGlobal_uint64.SetValue(3, 128);
        shapeoutGlobal_uint64.SetValue(4, 128);
    ...
    }
    
  • In the following example, two outputs of the operator depend on computation. The data type of the output tensor is uint64_t. After the computation is complete, the output shape is (16, 32) and (1, 16, 16, 32). The output parameter shape_out is used to store the shape information. The following is an example:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    extern "C" __global__ __aicore__ void xxx_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z, GM_ADDR shape_out, GM_ADDR workspace, GM_ADDR tiling) {
        ...
        // The shape of two outputs needs to be updated, a two-dimensional [16, 32] and a four-dimensional [1, 16, 16, 32].
        // The tensor type is uint64_t.
        constexpr uint32_t SHAPEOUT_SIZE_2 = 18;
        GlobalTensor<uint64_t> shapeoutGlobal_uint64_2;
        shapeoutGlobal_uint64_2.SetGlobalBuffer((__gm__ uint64_t*)shape_out, SHAPEOUT_SIZE_2 );
        shapeoutGlobal_uint64_2.SetValue(0, 0x0000000010000000 | 2);
        shapeoutGlobal_uint64_2.SetValue(1, 16);
        shapeoutGlobal_uint64_2.SetValue(2, 32);
        // index[3] to index[8] are placeholders.
        shapeoutGlobal_uint64_2.SetValue(9, 0x0000000010000000 | 4);
        shapeoutGlobal_uint64_2.SetValue(10, 1);
        shapeoutGlobal_uint64_2.SetValue(11, 16);
        shapeoutGlobal_uint64_2.SetValue(12, 16);
        shapeoutGlobal_uint64_2.SetValue(13, 32);
        ...
    }