Tiling Implementation on the Host

Operator Implementation has described the tiling implementation on the host. This section focuses on the programming mode and API usage when the CANN framework is used.

In most cases, the local memory cannot completely store the operator input and output. Therefore, you need to move some inputs for computation multiple times until the complete final result is obtained. This process of data tiling and block computation is called tiling. Tiling implementation is a compute program that determines parameters related to the data tiling algorithm (such as the block size to be moved each time and the total number of cycles) based on information such as the shape of an operator.

After tiling is implemented, the obtained parameters related to the tiling algorithm are passed to the kernel to guide parallel data tiling. The AI Core is not good at Scalar computation in the tiling implementation. Therefore, the AI Core is executed on the host CPU independently.

Figure 1 Input and output of tiling implementation

As shown in the preceding figure, tiling implementation is a process of determining parameters related to the tiling algorithm based on information such as the operator shape. The information such as the operator shape can be considered as the input of tiling implementation, and the parameters related to the tiling algorithm can be considered as the output of tiling implementation. The input and output are carried by the parameters (TilingContext* context structure) of the tiling function. You can obtain the input of tiling implementation: input, output, and attribute information of the operator from the context structure. After tiling, you can obtain the output of the tiling implementation: TilingData structure (parameters related to the tiling algorithm), blockDim variable, TilingKey that is used to select different kernel implementation branches, and operator workspace, and then set the output to the context structure.

The concepts of TilingData, blockDim, TilingKey and workspace are described as follows:

  • TilingData: parameter related to the tiling algorithm, such as the block size passed each time and the total number of cycles. This parameter is stored in a structure and designed by developers.

    The TilingData structure can be defined in a single structure or nested structures.

    • Single-structure definition method, which is defined in tile mode.
      1
      2
      3
      4
      5
      6
      7
      8
      9
      namespace optiling {
      BEGIN_TILING_DATA_DEF(MyAddTilingData)  // Declare the name of the tiling structure.
        TILING_DATA_FIELD_DEF(uint32_t, field1);   // Types and names of structure members.
        TILING_DATA_FIELD_DEF(uint32_t, field2);
        TILING_DATA_FIELD_DEF(uint32_t, field3);
      END_TILING_DATA_DEF;
       
      REGISTER_TILING_DATA_CLASS(MyAdd, MyAddTilingData)  // Register the tiling structure with the operator.
      }
      

      The method of assigning values to tiling structure members in the tiling implementation function is as follows:

      1
      2
      3
      MyAddTilingData myTiling;
      myTiling.set_field1(1);
      myTiling.set_field2(2);
      
    • Structure nesting is supported.
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      16
      17
      18
      19
      namespace optiling {
      BEGIN_TILING_DATA_DEF(MyStruct1)  // Declare the name of structure 1.
        TILING_DATA_FIELD_DEF(uint32_t, field1);   // Types and names of structure members.
        TILING_DATA_FIELD_DEF(uint32_t, field2);   // Types and names of structure members.
      END_TILING_DATA_DEF; 
      REGISTER_TILING_DATA_CLASS(MyStruct1Op, MyStruct1) // Register the structure with <op_type>Op.
      
      BEGIN_TILING_DATA_DEF(MyStruct2)  // Declare the name of structure 2.
        TILING_DATA_FIELD_DEF(uint32_t, field3);   // Types and names of structure members.
        TILING_DATA_FIELD_DEF(uint32_t, field4);   // Types and names of structure members.
      END_TILING_DATA_DEF;
      REGISTER_TILING_DATA_CLASS(MyStruct2Op, MyStruct2) // Register the structure with <op_type>Op.
      
      BEGIN_TILING_DATA_DEF(MyAddTilingData)  // Declare the name of the tiling structure.
        TILING_DATA_FIELD_DEF_STRUCT(MyStruct1, st1);   // Reference structure of the structure member.
        TILING_DATA_FIELD_DEF_STRUCT(MyStruct2, st2);   // Reference structure of the structure member.
      END_TILING_DATA_DEF;
      REGISTER_TILING_DATA_CLASS(MyAdd, MyAddTilingData)  // Register the tiling structure with the operator.
      }
      

      The method of assigning values to tiling structure members in the tiling implementation function is as follows:

      1
      2
      3
      4
      5
      MyAddTilingData myTiling;
      myTiling.st1.set_field1(1);
      myTiling.st1.set_field2(2);
      myTiling.st2.set_field3(3);
      myTiling.st2.set_field4(4);
      
  • blockDim: number of cores on which the kernel function will be executed. For example, if 8 MB data needs to be computed with 1 MB data on each core, blockDim is set to 8. However, to make full use of hardware resources, blockDim is generally set to the number of cores of the hardware platform, and data is tiled based on the number of cores.

    blockDim is a concept about logical cores, and its value range is [1, 65535]. To fully utilize hardware resources, set this parameter to the number of physical cores or a multiple of the number of physical cores. For the coupled architecture and separated architecture, the meaning and setting rules of blockDim during running are different. The details are as follows:

    • Coupled architecture: Because the Vector and Cube Units are integrated, blockDim is used to start multiple AI Core instances, without differentiating Vector Units and Cube Units. The number of AI Cores can be obtained by calling GetCoreNumAiv or GetCoreNumAic.
    • Separated architecture
      • For operators that contain only Vector Units, blockDim is used to set the number of vector (AIV) instances to be started. For example, if an AI processor has 40 Vector cores, set blockDim to 40.
      • For operators that contain only Cube Units, blockDim is used to set the number of cube (AIC) instances to be started. For example, if an AI processor has 20 Cube cores, set blockDim to 20.
      • Operators for Vector/Cube fusion computing are started by groups of AIVs and AICs. blockDim is used to set the number of groups to be started. For example, if an AI processor has 40 Vector cores and 20 Cube cores, a group consists of two Vector cores and one Cube core. Set the number of groups to 20. In this case, 20 groups are started, including 40 Vector cores and 20 Cube cores. Note: In this scenario, the number of blockDim (logical cores) cannot exceed the number of physical cores (a physical core contains two Vector cores and one Cube core).
      • The number of AIC and AIV cores can be obtained by calling GetCoreNumAic and GetCoreNumAiv, respectively.
  • TilingKey (optional): TilingKey is a method used within an operator to distinguish different implementations by tiling the kernel code. This method helps optimize the performance of a single kernel call and reduces unnecessary instruction cache (icache) misses and scalar processing time, similar to the template mechanism in C++. Different kernel implementation branches can be identified by TilingKey. After the TilingKey is set on the host, choose the corresponding selection. For example, an operator has different algorithm logic in different shapes. The kernel can use the tiling key to select different algorithm logic. However, the tiling algorithm on the host can be different. As a result, the host and kernel can use the same tiling key for association.

    Assume that the following kernel code exists:

    1
    2
    3
    4
    5
    if (condition) {
      ProcessA();
    } else {
      ProcessB();
    }
    

    If ProcessA and ProcessB are large functions, the preceding code will be larger after compilation. However, only one selection is chosen each time the kernel runs. The condition judgment and jump are as large as 16 KB to 32 KB, which varies according to chips. Then, iCache miss occurs. You can use TilingKey to set different TilingKey 1 and TilingKey 2 for the processing functions of the two kernels.

    1
    2
    3
    4
    5
    if (TILING_KEY_IS(1)) {
      ProcessA();
    } else if (TILING_KEY_IS(2)) {
      ProcessB();
    }
    

    In this way, the device kernel automatically identifies two TilingKeys during compilation, compiles two kernel entry functions, and folds constants for condition judgment. In addition, this function needs to work with the host tiling function to set TilingKey to 1 for the scenario where ProcessA is used and set TilingKey to 2 for the scenario where ProcessB is used.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    static ge::graphStatus TilingFunc(gert::TilingContext* context)
    {
        // some code
        if (condition) {
            context->SetTilingKey(1);
        } else {
            context->SetTilingKey(2);
        }
        return ge::GRAPH_SUCCESS;
    }
    

    You can set --tiling_keys to specify TilingKey, and compile the kernel code related to the TilingKey, to accelerate compilation.

  • WorkspaceSize (optional): A workspace is a memory block on the global memory on the device. You can set the workspace size in the Tiling function. The framework allocates the global memory of the corresponding size on the device for the workspace. The workspace memory can be used when the operator is implemented on the kernel.

    The workspace memory consists of two parts: workspace memory required by Ascend C APIs and workspace memory used by operator implementation (on-demand).

    • Workspace memory needs to be reserved for Ascend C APIs.

      During API compute, some workspace memory is required as the cache. Therefore, the operator tiling function needs to reserve workspace memory for the API. The reserved memory size can be obtained by calling GetLibApiWorkSpaceSize.

    • Workspace memory used for operator implementation (on-demand)

      The memory needs to be allocated only when extra device memory is required for data exchange or caching in the operator. The memory is allocated based on the space computed by the operator.

    The overall workspace memory is the sum of the preceding two parts. To set the workspace memory in the tiling function, perform the following steps:

    1
    2
    auto workspaceSizes = context->GetWorkspaceSizes(1); // Use only one workspace.
    workspaceSizes[0] = sysWorkspaceSize + usrWorkspaceSize;
    

Basic Process of Tiling

The following figure shows the development process of Tiling.

Figure 2 Tiling development process

The following uses a simple Add operator as an example to describe the implementation process of tiling. In this sample, the shape size of the data to be processed can be evenly allocated to each core and aligned to the size of a data block (32 bytes).

Compile the header file for defining the TilingData structure of the operator. The file is named operator name_tiling.h and stored in the op_host directory of the operator project. The sample code is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
#ifndef ADD_CUSTOM_TILING_H
#define ADD_CUSTOM_TILING_H
#include "register/tilingdata_base.h"

namespace optiling {
BEGIN_TILING_DATA_DEF(TilingData)               // Register a tiling class and uses the tiling name as the argument.
  TILING_DATA_FIELD_DEF(uint32_t, totalLength); // Add the tiling field to compute the total data volume.
  TILING_DATA_FIELD_DEF(uint32_t, tileNum);     // Add the tiling field that specifies the total number of data blocks to be computed on each core.
END_TILING_DATA_DEF;
// Register the operator TilingData class with the corresponding AddCustom operator.
REGISTER_TILING_DATA_CLASS(AddCustom, TilingData)
}
#endif // ADD_CUSTOM_TILING_H

The procedure is as follows:

  1. Compile the code framework, add the judgment condition of #ifndef... to prevent repeated inclusion of header files, and include the register/tilingdata_base.h header file. Multiple macros for TilingData registration are defined in tilingdata_base.h. The sample code is as follows:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    #ifndef ADD_CUSTOM_TILING_H
    #define ADD_CUSTOM_TILING_H
    #include "register/tilingdata_base.h"
    
    namespace optiling {
    // Tiling structure definition and registration code
    // ...
    }
    #endif // ADD_CUSTOM_TILING_H
    
  2. Design TilingData parameters, which are essentially related to parallel data tiling. In this example, the operator uses two tiling parameters: totalLength and tileNum. totalLength indicates the amount of data to be computed, and tileNum indicates the total number of computed data blocks on each core. For example, after totalLength is passed to the kernel, the compute amount of each core can be obtained by dividing totalLength by the number of cores involved in the compute. In this way, multi-core data is tiled.
  3. Define the TilingData structure. Define a TilingData class by calling BEGIN_TILING_DATA_DEF, add the totalLength and tileNum fields by calling TILING_DATA_FIELD_DEF, and end the TilingData definition by calling END_TILING_DATA_DEF. For details about the APIs, see TilingData Structure Definition.
    1
    2
    3
    4
    BEGIN_TILING_DATA_DEF(TilingData)               // Register a tiling class and uses the tiling name as the argument.
      TILING_DATA_FIELD_DEF(uint32_t, totalLength); // Add the tiling field to compute the total data volume.
      TILING_DATA_FIELD_DEF(uint32_t, tileNum);     // Add the tiling field that specifies the total number of data blocks to be computed on each core.
    END_TILING_DATA_DEF;
    
  4. Register the TilingData structure, call REGISTER_TILING_DATA_CLASS to register the TilingData class, and associate the TilingData class with custom operator. The first parameter of REGISTER_TILING_DATA_CLASS is op_type (operator type). In this example, AddCustom is passed in, and the second parameter is the class name of TilingData. For details about the REGISTER_TILING_DATA_CLASS API, see TilingData Structure Registration.
    1
    2
    // Register the operator TilingData class with the corresponding AddCustom operator.
    REGISTER_TILING_DATA_CLASS(AddCustom, TilingData)
    

Implement the Tiling function in the .cpp file of the operator host. The file is named operator name.cpp and stored in the op_host directory of the operator project. The sample code is as follows:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
namespace optiling {
const uint32_t BLOCK_DIM = 8;
const uint32_t TILE_NUM = 8;
static ge::graphStatus TilingFunc(gert::TilingContext *context)
{
    TilingData tiling;
    uint32_t totalLength = context->GetInputShape(0)->GetOriginShape().GetShapeSize();
    context->SetBlockDim(BLOCK_DIM);
    tiling.set_totalLength(totalLength);
    tiling.set_tileNum(TILE_NUM);
    tiling.SaveToBuffer(context->GetRawTilingData()->GetData(), context->GetRawTilingData()->GetCapacity());
    context->GetRawTilingData()->SetDataSize(tiling.GetDataSize());
    size_t *currentWorkspace = context->GetWorkspaceSizes(1);
    currentWorkspace[0] = 0;
    return ge::GRAPH_SUCCESS;
}
} // namespace optiling

The procedure is as follows:

  1. Obtain TilingContext, that is, the argument gert::TilingContext* context of the Tiling function.
  2. Set TilingData. In the preceding 3, the TilingData class is defined. In this case, you can use the TilingData class to define a specific instance and call the set_+field_name API of the TilingData class to set the field value of the TilingData class, the SaveToBuffer API of the TilingData class is called to serialize and save TilingData.
    1. Obtain the input and output shape information based on the context. In this sample, the GetInputShape API of TilingContext is used to obtain the input shape size.
      1
      2
      // Obtain the input shape information.
      uint32_t totalLength = context->GetInputShape(0)->GetOriginShape().GetShapeSize();
      
    2. Set TilingData. Call the set_+field_name method to set the field value of TilingData.
      1
      2
      3
      4
      5
      // Use TilingData to define a specific instance.
      TilingData tiling;
      // Set TilingData.
      tiling.set_totalLength(totalLength);
      tiling.set_tileNum(TILE_NUM);
      
    3. Call SaveToBuffer of the TilingData class to complete serialization and save the serialization to TilingContext. The first parameter of SaveToBuffer is the start address of the storage buffer, and the second parameter is the length of the buffer. Call GetRawTilingData to obtain the address of the typeless TilingData, and then call GetData to obtain the data pointer as the start address of the buffer. Call GetRawTilingData to obtain the address of the typeless TilingData, and then call GetCapacity to obtain the length of the TilingData as the length of the buffer. After the SaveToBuffer operation is complete, you need to set the length of TilingData by calling SetDataSize. The length is obtained by calling GetDataSize of the TilingData class.
      1
      2
      3
      // Serialize and save the file.
      tiling.SaveToBuffer(context->GetRawTilingData()->GetData(), context->GetRawTilingData()->GetCapacity());
      context->GetRawTilingData()->SetDataSize(tiling.GetDataSize());
      
  3. Call SetBlockDim to set blockDim.
    1
    context->SetBlockDim(BLOCK_DIM);
    
  4. (Optional) Call SetTilingKey to set TilingKey.
    1
    context->SetTilingKey(1);
    
  5. (Optional) Call GetWorkspaceSizes to obtain the workspace size pointer and set the size. This is only an example. Set the size of workspace to 0.
    1
    2
    size_t *currentWorkspace = context->GetWorkspaceSizes(1);
    currentWorkspace[0] = 0;
    

More Examples of Tiling Parameter Design - Passing Attribute Information Through TilingData

If an operator contains attribute information, the attribute information can be passed to the kernel through TilingData to participate in the compute of the operator kernel function in the kernel. The ReduceMaxCustom operator is used as an example. This operator is used to return the maximum value of the input data by dim and return the index. The ReduceMaxCustom operator has two attributes: reduceDim and isKeepDim. reduceDim indicates the dimension based on which the reduce operation is performed. isKeepDim indicates whether to keep the output dimension the same as the input dimension. In this sample, the reduce operation can be performed only on the last dimension, and the input data type is half.

  1. The definition of TilingData of the ReduceMaxCustom operator is as follows (reduceAxisLen is the focus): The reduceAxisLen parameter indicates the length of the reduceDim axis, that is, the length of the last dimension. This parameter will be passed to the kernel through TilingData for computation.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    #ifndef REDUCE_MAX_CUSTOM_TILING_H
    #define REDUCE_MAX_CUSTOM_TILING_H
    #include "register/tilingdata_base.h"
    namespace optiling {
    BEGIN_TILING_DATA_DEF(ReduceMaxTilingData)
      TILING_DATA_FIELD_DEF(uint32_t, reduceAxisLen); // Add the tiling field and reduce the length of the dim axis.
    // Definitions of other TilingData parameters.
      ...
    END_TILING_DATA_DEF;
    // Register the operator TilingData class with the corresponding ReduceMaxCustom operator.
    REGISTER_TILING_DATA_CLASS(ReduceMaxCustom, ReduceMaxTilingData)
    }
    #endif // REDUCE_MAX_CUSTOM_TILING_H
    
  2. Tiling of the ReduceMaxCustom operator is implemented as follows: The process of passing attribute information through TilingData is as follows: Obtain the reduceDim attribute value from attr through TilingContext, obtain the reduceDim axis length based on the reduceDim attribute value, and set the length to TilingData.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    namespace optiling {
    static ge::graphStatus TilingFunc(gert::TilingContext* context)
    {
        ReduceMaxTilingData tiling;
        // Obtain the reduceDim attribute value from attr. Because reduceDim is the first attribute, the index value passed by GetAttrPointer is 0.
        const gert::RuntimeAttrs* attrs = context->GetAttrs();
        const uint32_t* reduceDim = attrs->GetAttrPointer<uint32_t>(0);
        // Obtain the length of the reduceDim axis.
        const gert::StorageShape* xShapePtr = context->GetInputShape(0);
        const gert::Shape& xShape = xShapePtr->GetStorageShape();
        const uint32_t reduceAxisLen = xShape.GetDim(*reduceDim);
        // Compute the values of member variables except reduceAxisLen in TilingData.
        ...
        // Set reduceAxisLen to the tiling structure and pass it to the kernel function.
        tiling.set_reduceAxisLen(reduceAxisLen);
        // Set the values of member variables except reduceAxisLen in TilingData.
        ...
        // Serialize and save TilingData.
        tiling.SaveToBuffer(context->GetRawTilingData()->GetData(), context->GetRawTilingData()->GetCapacity());
        context->GetRawTilingData()->SetDataSize(tiling.GetDataSize());
        ...
        return ge::GRAPH_SUCCESS;
    }} // namespace optiling
    

More Examples of Tiling Parameter Design - Matching Tiling When Using High-Level APIs

  1. Define the tiling structure.
    1
    2
    3
    4
    5
    6
    7
    namespace optiling {
    BEGIN_TILING_DATA_DEF(MyAddTilingData)  // Declare the name of the tiling structure.
      TILING_DATA_FIELD_DEF_STRUCT(TCubeTiling, cubeTilingData);   // References of the tiling structure of the high-level API.
      TILING_DATA_FIELD_DEF(uint32_t, field);   // Reference structure of the structure member.
    END_TILING_DATA_DEF;
    REGISTER_TILING_DATA_CLASS(MyAdd, MyAddTilingData)  // Register the tiling structure with the operator.
    }
    
  2. Use the tiling function of the high-level API to initialize the tiling structure.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    static ge::graphStatus TilingFunc(gert::TilingContext* context) {
        int32_t M = 1024;
        int32_t N = 640;
        int32_t K = 256;
        int32_t baseM = 128;
        int32_t baseN = 128;
        auto ascendcPlatform = platform_ascendc::PlatformAscendC(context->GetPlatformInfo());
        MultiCoreMatmulTiling cubeTiling(ascendcPlatform);
        cubeTiling.SetDim(2);
        cubeTiling.SetAType(TPosition::GM, CubeFormat::ND, matmul_tiling::DataType::DT_FLOAT16);
        cubeTiling.SetBType(TPosition::GM, CubeFormat::ND, matmul_tiling::DataType::DT_FLOAT16);
        cubeTiling.SetCType(TPosition::LCM, CubeFormat::ND, matmul_tiling::DataType::DT_FLOAT);
        cubeTiling.SetBiasType(TPosition::GM, CubeFormat::ND, matmul_tiling::DataType::DT_FLOAT);
        cubeTiling.SetShape(M, N, K);
        cubeTiling.SetOrgShape(M, N, K);
        cubeTiling.SetFixSplit(baseM, baseN, -1);
        cubeTiling.SetBias(true);
        cubeTiling.SetBufferSpace(-1, -1, -1);
        MyAddTilingData tiling;
        if (cubeTiling.GetTiling(tiling.cubeTilingData) == -1){
            return ge::GRAPH_FAILED;
        }
        // some code
    }
    

Using the Standard C++ Syntax to Define TilingData

The TilingData structure can be defined using the standard C++ syntax. Compared with macros such as BEGIN_TILING_DATA_DEF, this method is more flexible and complies with the development habits of developers. Features such as structure arrays, custom TilingData assignment functions, and structures with the same name are supported. The procedure is as follows:

  1. Use the C++ syntax to define the TilingData structure.

    The header file of the structure definition must be stored in the op_kernel directory of the operator project.

    1
    2
    3
    4
    5
    6
    7
    8
    9
    #ifndef ADD_CUSTOM_TILING_H
    #define ADD_CUSTOM_TILING_H
    #include <cstdint>
    class TilingData{
    public:
      uint32_t totalLength;
      uint32_t tileNum;
    };
    #endif // ADD_CUSTOM_TILING_H
    
  2. Assign values to TilingData in the tiling function on the host.
    • Contain the TilingData definition header file.
    • Obtain TilingData through GetTilingData and assign values to its member variables.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    #include "../op_kernel/add_custom_tiling.h" // Contain the header file of the TilingData definition.
    namespace optiling {
    const uint32_t BLOCK_DIM = 8;
    const uint32_t TILE_NUM = 8;
    static ge::graphStatus TilingFunc(gert::TilingContext* context)
    {
        TilingData *tiling = context->GetTilingData<TilingData>(); // Obtain the TilingData structure.
        uint32_t totalLength = context->GetInputTensor(0)->GetShapeSize();
        context->SetBlockDim(BLOCK_DIM);
        tiling->totalLength = totalLength;  // Assign values to the member variables of TilingData.
        tiling->tileNum = TILE_NUM;         // Assign values to the member variables of TilingData.
        size_t *currentWorkspace = context->GetWorkspaceSizes(1);
        currentWorkspace[0] = 0;
        return ge::GRAPH_SUCCESS;
    }
    } // namespace optiling
    
  3. Use REGISTER_TILING_DEFAULT or REGISTER_TILING_FOR_TILINGKEY on the kernel to register the TilingData structure and parse the tiling data. REGISTER_TILING_DEFAULT is also used to identify the TilingData structure defined using the standard C++ syntax.

    Register the TilingData structure to notify the framework user to use the standard C++ syntax for defining TilingData, and also notify the framework of the TilingData structure type for parsing tiling data.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    #include "kernel_operator.h"
    #include "add_custom_tiling.h"  // Contain the header file of the TilingData definition.
    extern "C" __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z, GM_ADDR workspace, GM_ADDR tiling)
    {
        REGISTER_TILING_DEFAULT(TilingData);    // Register the TilingData structure by default.
        GET_TILING_DATA(tilingData, tiling);    // Parse the tiling data on the kernel and assigns values to the TilingData structure.
        KernelAdd op;
        op.Init(x, y, z, tilingData.totalLength, tilingData.tileNum);
        ......
    }
    

When the standard C++ syntax is used to define TilingData, the restrictions are as follows:

  • The TilingData member variables do not support the pointer and reference types.
    1
    2
    3
    4
    5
    class TilingData {
    public:
      uint32_t* totalLength; // The pointer type is not supported. The host cannot pass the pointer to the device.
      uint32_t& tileNum;       // The reference type is not supported. The host cannot pass the pointer to the device.
    };
    
  • The TilingData structure does not support dynamic class scheduling and the template class. For example, a base class cannot be converted into a derived class.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    class A {
    public:
      uint32_t totalLength;
      uint32_t tileNum;
    };
    class B: public A {
    public:
      uint32_t xxx;
      uint32_t xxx;
    };
    static ge::graphStatus TilingFunc(gert::TilingContext* context)
    {
        // Incorrect use
        B *tiling = context->GetTilingData<A>(); // Not supported. Unknown issues will be triggered.
        // Correct use
        B *tiling = context->GetTilingData<B>();
        ......
        return ge::GRAPH_SUCCESS;
    }
    
  • The TilingData obtained by GetTilingData does not contain the initial value. You need to explicitly assign a value or define and call the assignment function in the TilingData class.
    1
    2
    3
    4
    5
    6
    7
    8
    9
    static ge::graphStatus TilingFunc(gert::TilingContext* context)
    {
        TilingData *tiling = context->GetTilingData<TilingData>(); // Obtain the TilingData structure. In this case, the values of totalLength and tileNum are 0, without the initial values.
        ......
        // Display values.
        tiling->totalLength = totalLength;  // Assign values to the member variables of TilingData.
        tiling->tileNum = TILE_NUM;         // Assign values to the member variables of TilingData.
        ......
        return ge::GRAPH_SUCCESS;
    

Tiling Template Programming

As described in section "TilingKey programming", TilingKey is difficult to remember and understand because it is usually a long number without clear meaning.

If multiple TilingKeys are involved, developers rely on the TilingKeys to manage the kernel implementation, which is complex in both management and usage. To simplify this process, use template programming instead of traditional TilingKey programming to reduce the dependency on TilingKey values and achieve more intuitive and efficient kernel management. The procedure is as follows:

  1. In the op_kernel folder of the custom operator project, add a header file that contains the template argument declaration and template argument selection. In this example, the header file is named tiling_key_add_custom.h.
    • Include the template header file ascendc/host_api/tiling/template_argument.h.
    • Define the template argument declaration ASCENDC_TPL_ARGS_DECL and template argument selection ASCENDC_TPL_ARGS_SEL (available template). For details about the API, see Template Argument Definition.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    #include "ascendc/host_api/tiling/template_argument.h"
    
    #define ADD_TPL_FP16 1 // Data type definition
    #define ADD_TPL_FP32 0
    
    #define ADD_TPL_ND 2 // Data format definition
    #define ADD_TPL_NZ 29
    
    // Template argument declaration
    ASCENDC_TPL_ARGS_DECL(AddTemplateCustom, // Operator type
    ASCENDC_TPL_DTYPE_DECL(D_T_X, ADD_TPL_FP16, ADD_TPL_FP32), // Template argument declaration of DataType: Data type of argument x. The value can be float16 or float32.
    ASCENDC_TPL_DTYPE_DECL(D_T_Y, ADD_TPL_FP16, ADD_TPL_FP32), // Template argument declaration of DataType: Data type of argument y. The value can be float16 or float32.
    ASCENDC_TPL_DTYPE_DECL(D_T_Z, ADD_TPL_FP16, ADD_TPL_FP32), // Template argument declaration of DataType: Data type of argument z. The value can be float16 or float32.
    ASCENDC_TPL_UINT_DECL(TILE_NUM, ASCENDC_TPL_8_BW, ASCENDC_TPL_UI_MIX, 2, 0, 2, 3, 5, 10, 12, 13, 9, 8),// Template argument declaration of the custom UINT type: The template arguments include the number of tiles, encoding bit width ASCENDC_TPL_8_BW ( 8 bits), which indicates that the number of template arguments does not exceed the 8-bit range, and ASCENDC_TPL_UI_MIX, which indicates that the value range is expressed in mixed mode. There are two groups of data {0–2} and {3–5} and exhaustive values 10, 12, 13, 9, and 8. The final result is {0, 1, 2, 3, 4, 5, 10, 12, 13, 9, 8}.
    ASCENDC_TPL_BOOL_DECL(IS_SPLIT, 0, 1), // Template argument declaration of the custom bool type: The template argument is a flag bit indicating whether to split. 1: split; 0: not split.
    );
    
    // Template argument selection
    // Used to check whether the TilingKey is valid when GET_TPL_TILING_KEY is called to obtain the TilingKey.
    ASCENDC_TPL_SEL(
        ASCENDC_TPL_ARGS_SEL(
        ASCENDC_TPL_DTYPE_SEL(D_T_X, ADD_TPL_FP16),
        ASCENDC_TPL_DTYPE_SEL(D_T_Y, ADD_TPL_FP16),
        ASCENDC_TPL_DTYPE_SEL(D_T_Z, ADD_TPL_FP16),
        ASCENDC_TPL_UINT_SEL(TILE_NUM, ASCENDC_TPL_UI_LIST, 1, 8),
        ASCENDC_TPL_BOOL_SEL(IS_SPLIT, 0, 1),
        ),
        ASCENDC_TPL_ARGS_SEL(
        ASCENDC_TPL_DTYPE_SEL(D_T_X, ADD_TPL_FP32),
        ASCENDC_TPL_DTYPE_SEL(D_T_Y, ADD_TPL_FP32),
        ASCENDC_TPL_DTYPE_SEL(D_T_Z, ADD_TPL_FP32),
        ASCENDC_TPL_UINT_SEL(TILE_NUM, ASCENDC_TPL_UI_LIST, 1, 8),
        ASCENDC_TPL_BOOL_SEL(IS_SPLIT, 0, 1),
        ),
    );
    
  2. Call the GET_TPL_TILING_KEY API on the host to generate a TilingKey.
    • In the host implementation file, include the header file that defines the template argument declaration and template argument selection in Step 1.
    • Call the GET_TPL_TILING_KEY API to generate a TilingKey. The input arguments of GET_TPL_TILING_KEY are the specific values of the template arguments, and must be in the same sequence as the template arguments in the header file that defines the template argument declaration and template argument selection.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    #include "tiling_key_add_custom.h"
    static ge::graphStatus TilingFunc(gert::TilingContext *context)
    {
        TilingData tiling;
        uint32_t totalLength = context->GetInputShape(0)->GetOriginShape().GetShapeSize();
        ge::DataType dtype_x = context->GetInputDesc(0)->GetDataType();
        ge::DataType dtype_y = context->GetInputDesc(1)->GetDataType();
        ge::DataType dtype_z = context->GetOutputDesc(1)->GetDataType();
        uint32_t D_T_X = ADD_TPL_FP32, D_T_Y=ADD_TPL_FP32, D_T_Z=ADD_TPL_FP32, TILE_NUM=1, IS_SPLIT=0;
        if(dtype_x == ge::DataType::DT_FLOAT){
            D_T_X = ADD_TPL_FP32;
        }else if(dtype_x == ge::DataType::DT_FLOAT16){
            D_T_X = ADD_TPL_FP16;
        }
        if(dtype_y == ge::DataType::DT_FLOAT){
            D_T_Y = ADD_TPL_FP32;
        }else if(dtype_y == ge::DataType::DT_FLOAT16){
            D_T_Y = ADD_TPL_FP16;
        }
        if(dtype_z == ge::DataType::DT_FLOAT){
            D_T_Z = ADD_TPL_FP32;
        }else if(dtype_z == ge::DataType::DT_FLOAT16){
            D_T_Z = ADD_TPL_FP16;
        }
        if(totalLength< MIN_LENGTH_FOR_SPLIT){
            IS_SPLIT = 0;
            TILE_NUM = 1;
        }else{
            IS_SPLIT = 1;
            TILE_NUM = DEFAULT_TILE_NUM;
        }
        context->SetBlockDim(BLOCK_DIM);
        tiling.set_totalLength(totalLength);
        tiling.SaveToBuffer(context->GetRawTilingData()->GetData(), context->GetRawTilingData()->GetCapacity());
        context->GetRawTilingData()->SetDataSize(tiling.GetDataSize());
        const uint64_t tilingKey = GET_TPL_TILING_KEY(D_T_X, D_T_Y, D_T_Z, TILE_NUM, IS_SPLIT);
        context->SetTilingKey(tilingKey);
        size_t *currentWorkspace = context->GetWorkspaceSizes(1);
        currentWorkspace[0] = 0;
        return ge::GRAPH_SUCCESS;
    }
    
  3. Implement on the kernel.
    • In the host implementation file, include the header file that defines the template argument declaration and template argument selection in Step 1.
    • Add a template to the kernel function to support the passing of template arguments. These arguments must be in the same sequence as those in the header file that defines the template argument declaration and template argument selection.
    • Select different kernels based on the branch judgment of template arguments.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    #include "tiling_key_add_custom.h"
    ...
    ...
    template<int D_T_X, int D_T_Y, int D_T_Z, int TILE_NUM, int IS_SPLIT>
     __global__ __aicore__ void add_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z, GM_ADDR workspace, GM_ADDR tiling)
    {
        GET_TILING_DATA(tiling_data, tiling);
        if(D_T_X == ADD_TPL_FP32 && D_T_Y == ADD_TPL_FP32 && D_T_Z == ADD_TPL_FP32){
            KernelAdd<float, float, float> op;
            op.Init(x, y, z, tiling_data.totalLength, TILE_NUM, IS_SPLIT);
            op.Process1();
        }else if(D_T_X == ADD_TPL_FP16 && D_T_Y == ADD_TPL_FP16 && D_T_Z == ADD_TPL_FP16){
            KernelAdd<half, half, half> op;
            if(IS_SPLIT == 0){
                op.Init(x, y, z, tiling_data.totalLength, TILE_NUM, IS_SPLIT);
                op.Process1();
            }else if(IS_SPLIT==1){
                op.Init(x, y, z, tiling_data.totalLength, TILE_NUM, IS_SPLIT);
                op.Process2();
            }
        }
    }
    

For details about the complete sample, see the tiling template programming sample.