Basic Process
SIMD Operator Implementation has described the tiling implementation on the host. This section focuses on the programming mode and API usage when the CANN framework is used.
In most cases, the local memory cannot completely store the operator input and output. Therefore, you need to move some inputs for computation multiple times until the complete final result is obtained. This process of data tiling and block computation is called tiling. Tiling implementation is a compute program that determines parameters related to the data tiling algorithm (such as the block size to be moved each time and the total number of cycles) based on information such as the shape of an operator.
After tiling is implemented, the obtained parameters related to the tiling algorithm are passed to the kernel to guide parallel data tiling. The AI Core is not good at scalar computation in the tiling implementation. Therefore, the AI Core is executed on the host CPU independently.
As shown in the preceding figure, tiling implementation is a process of determining parameters related to the tiling algorithm based on information such as the operator shape. The information such as the operator shape can be considered as the input of tiling implementation, and the parameters related to the tiling algorithm can be considered as the output of tiling implementation. The input and output are carried by the parameters (TilingContext* context structure) of the tiling function. You can obtain the input of tiling implementation: input, output, and attribute information of the operator from the context structure. After tiling, you can obtain the output of the tiling implementation: TilingData structure (parameters related to the tiling algorithm), blockDim variable, TilingKey that is used to select different kernel implementation branches, and operator workspace, and then set the output to the context structure.
The concepts of TilingData, blockDim, TilingKey and workspace are described as follows:
- TilingData: parameter related to the tiling algorithm, such as the block size passed each time and the total number of cycles. This parameter is stored in a structure and designed by developers.
The TilingData structure can be defined in a single structure or nested structures.
- Single-structure definition method, which is defined in tile mode.
1 2 3 4 5 6 7 8 9
namespace optiling { BEGIN_TILING_DATA_DEF(MyAddTilingData) // Declare the name of the tiling structure. TILING_DATA_FIELD_DEF(uint32_t, field1); // Types and names of structure members. TILING_DATA_FIELD_DEF(uint32_t, field2); TILING_DATA_FIELD_DEF(uint32_t, field3); END_TILING_DATA_DEF; REGISTER_TILING_DATA_CLASS(MyAdd, MyAddTilingData) // Register the tiling structure with the operator. }
The method of assigning values to tiling structure members in the tiling implementation function is as follows:
1 2 3
MyAddTilingData myTiling; myTiling.set_field1(1); myTiling.set_field2(2);
- Structure nesting is supported.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
namespace optiling { BEGIN_TILING_DATA_DEF(MyStruct1) // Declare the name of structure 1. TILING_DATA_FIELD_DEF(uint32_t, field1); // Types and names of structure members. TILING_DATA_FIELD_DEF(uint32_t, field2); // Types and names of structure members. END_TILING_DATA_DEF; REGISTER_TILING_DATA_CLASS(MyStruct1Op, MyStruct1) // Register the structure with <op_type>Op. BEGIN_TILING_DATA_DEF(MyStruct2) // Declare the name of structure 2. TILING_DATA_FIELD_DEF(uint32_t, field3); // Types and names of structure members. TILING_DATA_FIELD_DEF(uint32_t, field4); // Types and names of structure members. END_TILING_DATA_DEF; REGISTER_TILING_DATA_CLASS(MyStruct2Op, MyStruct2) // Register the structure with <op_type>Op. BEGIN_TILING_DATA_DEF(MyAddTilingData) // Declare the name of the tiling structure. TILING_DATA_FIELD_DEF_STRUCT(MyStruct1, st1); // Reference structure of the structure member. TILING_DATA_FIELD_DEF_STRUCT(MyStruct2, st2); // Reference structure of the structure member. END_TILING_DATA_DEF; REGISTER_TILING_DATA_CLASS(MyAdd, MyAddTilingData) // Register the tiling structure with the operator. }
The method of assigning values to tiling structure members in the tiling implementation function is as follows:
1 2 3 4 5
MyAddTilingData myTiling; myTiling.st1.set_field1(1); myTiling.st1.set_field2(2); myTiling.st2.set_field3(3); myTiling.st2.set_field4(4);
- Single-structure definition method, which is defined in tile mode.
- blockDim: number of cores on which the kernel function will be executed. For example, if 8 MB data needs to be computed with 1 MB data on each core, blockDim is set to 8. However, to make full use of hardware resources, blockDim is generally set to the number of cores of the hardware platform, and data is tiled based on the number of cores.
blockDim is a concept about logical cores, and its value range is [1, 65535]. To fully utilize hardware resources, set this parameter to the number of physical cores or a multiple of the number of physical cores.
- In coupling mode and separation mode, the meaning and setting rules of blockDim at runtime are different. The details are as follows:
- Coupling mode: Because the Vector and Cube Units are integrated, blockDim is used to start multiple AI Core instances, without distinguishing between these units. The number of AI Cores can be obtained by calling GetCoreNumAiv or GetCoreNumAic.
- Separation mode
- For operators that contain only Vector Units, blockDim is used to set the number of vector (AIV) instances to be started. For example, if an AI processor has 40 vector cores, set blockDim to 40.
- For operators that contain only Cube Units, blockDim is used to set the number of cube (AIC) instances to be started. For example, if an AI processor has 20 Cube cores, set blockDim to 20.
- Operators for Vector/Cube fusion computing are started by groups of AIVs and AICs. blockDim is used to set the number of groups to be started. For example, if an AI processor has 40 Vector cores and 20 Cube cores, a group consists of two Vector cores and one Cube core. Set the number of groups to 20. In this case, 20 groups are started, including 40 Vector cores and 20 Cube cores. Note: In this scenario, the number of blockDim (logical cores) cannot exceed the number of physical cores (a physical core contains two Vector cores and one Cube core).
- The number of AIC and AIV cores can be obtained by calling GetCoreNumAic and GetCoreNumAiv, respectively.
- If the device resource limit feature is used, the value of blockDim set for the operator cannot exceed the number of cores returned by an API (like GetCoreNum, GetCoreNumAic, or GetCoreNumAiv) from PlatformAscendC. For example, if aclrtSetStreamResLimit is used to set the number of stream-level vector cores to 8, the return value of GetCoreNumAiv is 8. The value of blockDim set for the vector operator cannot exceed 8. Otherwise, resources of other streams are preempted, causing the resource limit to become invalid.
- In coupling mode and separation mode, the meaning and setting rules of blockDim at runtime are different. The details are as follows:
- TilingKey (optional): TilingKey is a method used within an operator to distinguish different implementations by tiling the kernel code. This method helps optimize the performance of a single kernel call and reduces unnecessary instruction cache (icache) misses and scalar processing time, similar to the template mechanism in C++. Different kernel implementation branches can be identified by TilingKey. After the TilingKey is set on the host, choose the corresponding selection. For example, an operator has different algorithm logics in different shapes. The kernel can use the tiling key to select a specific algorithm logic. However, the tiling algorithms on the host can be different, so the host and kernel can use the same tiling key for association.
Assume that the following kernel code exists:
1 2 3 4 5
if (condition) { ProcessA(); } else { ProcessB(); }
If ProcessA and ProcessB are large functions, the preceding code will be larger after compilation. However, only one selection is chosen each time the kernel runs. The condition judgment and jump are as large as 16 KB to 32 KB, which varies according to chips. Then, iCache miss occurs. You can use TilingKey to set different TilingKey 1 and TilingKey 2 for the processing functions of the two kernels.
1 2 3 4 5
if (TILING_KEY_IS(1)) { ProcessA(); } else if (TILING_KEY_IS(2)) { ProcessB(); }
In this way, the device kernel automatically identifies two TilingKeys during compilation, compiles two kernel entrypoint functions, and folds constants for condition judgment. In addition, this function needs to work with the host tiling function to set TilingKey to 1 for the scenario where ProcessA is used and set TilingKey to 2 for the scenario where ProcessB is used.
1 2 3 4 5 6 7 8 9 10
static ge::graphStatus TilingFunc(gert::TilingContext* context) { // some code if (condition) { context->SetTilingKey(1); } else { context->SetTilingKey(2); } return ge::GRAPH_SUCCESS; }
You can set --tiling_key to specify TilingKey, and compile the kernel code related to the TilingKey, to accelerate compilation.
- WorkspaceSize: The workspace is a memory block in the global memory on the device. The workspace size can be set in the tiling function. After the setting: In the single-operator API execution scenario, the first API can be called to obtain the workspace size, and then the developer can allocate the global memory of the corresponding size. In the graph input scenario, the framework automatically allocates the global memory of the corresponding size based on the configured size. After the workspace is allocated, the workspace memory can be used during operator kernel implementation.
The workspace memory consists of two parts: workspace memory required by Ascend C APIs and workspace memory used by operator implementation (on-demand).
- Workspace memory needs to be reserved for Ascend C APIs.
During API compute, some workspace memory is required as the cache. Therefore, the operator tiling function needs to reserve workspace memory for the API. The reserved memory size can be obtained by calling GetLibApiWorkSpaceSize.
- Workspace memory used for operator implementation (on-demand)
The memory needs to be allocated only when extra device memory is required for data exchange or caching in the operator. The memory is allocated based on the space computed by the operator.
The overall workspace memory is the sum of the preceding two parts. To set the workspace memory in the tiling function, perform the following steps:
1 2
auto workspaceSizes = context->GetWorkspaceSizes(1); // Use only one workspace. workspaceSizes[0] = sysWorkspaceSize + usrWorkspaceSize;
- Workspace memory needs to be reserved for Ascend C APIs.
Basic Process of Tiling
The following figure shows the development process of Tiling.
The following uses a simple Add operator as an example to describe the implementation process of tiling. In this sample, the shape size of the data to be processed can be evenly allocated to each core and aligned to the size of a data block (32 bytes).
Compile the header file for defining the TilingData structure of the operator. The file is named {Operator name}_tiling.h and stored in the op_host directory of the operator project. The sample code is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
#ifndef ADD_CUSTOM_TILING_H #define ADD_CUSTOM_TILING_H #include "register/tilingdata_base.h" namespace optiling { BEGIN_TILING_DATA_DEF(TilingData) // Register a tiling class and uses the tiling name as the input parameter. TILING_DATA_FIELD_DEF(uint32_t, totalLength); // Add the tiling field to compute the total data volume. TILING_DATA_FIELD_DEF(uint32_t, tileNum); // Add the tiling field that specifies the total number of data blocks to be computed on each core. END_TILING_DATA_DEF; // Register the operator TilingData class with the corresponding AddCustom operator. REGISTER_TILING_DATA_CLASS(AddCustom, TilingData) } #endif // ADD_CUSTOM_TILING_H |
The procedure is as follows:
- Compile the code framework, add the judgment condition of #ifndef... to prevent repeated inclusion of header files, and include the register/tilingdata_base.h header file. Multiple macros for TilingData registration are defined in tilingdata_base.h. The sample code is as follows:
1 2 3 4 5 6 7 8 9
#ifndef ADD_CUSTOM_TILING_H #define ADD_CUSTOM_TILING_H #include "register/tilingdata_base.h" namespace optiling { // Tiling structure definition and registration code // ... } #endif // ADD_CUSTOM_TILING_H
- Design TilingData parameters, which are essentially related to parallel data tiling. In this example, the operator uses two tiling parameters: totalLength and tileNum. totalLength indicates the amount of data to be computed, and tileNum indicates the total number of computed data blocks on each core. For example, after totalLength is passed to the kernel, the compute amount of each core can be obtained by dividing totalLength by the number of cores involved in the compute. In this way, multi-core data is tiled.
- Define the TilingData structure. Define a TilingData class by calling BEGIN_TILING_DATA_DEF, add the totalLength and tileNum fields by calling TILING_DATA_FIELD_DEF, and end the TilingData definition by calling END_TILING_DATA_DEF. For details about the APIs, see TilingData Structure Definition.
1 2 3 4
BEGIN_TILING_DATA_DEF(TilingData) // Register a tiling class and uses the tiling name as the input parameter. TILING_DATA_FIELD_DEF(uint32_t, totalLength); // Add the tiling field to compute the total data volume. TILING_DATA_FIELD_DEF(uint32_t, tileNum); // Add the tiling field that specifies the total number of data blocks to be computed on each core. END_TILING_DATA_DEF;
- Register the TilingData structure, call REGISTER_TILING_DATA_CLASS to register the TilingData class, and associate the TilingData class with custom operator. The first parameter of REGISTER_TILING_DATA_CLASS is op_type (operator type). In this example, AddCustom is passed in, and the second parameter is the class name of TilingData. For details about the REGISTER_TILING_DATA_CLASS API, see TilingData Structure Registration.
1 2
// Register the operator TilingData class with the corresponding AddCustom operator. REGISTER_TILING_DATA_CLASS(AddCustom, TilingData)
Implement the Tiling function in the .cpp file of the operator host. The file is named operator name.cpp and stored in the op_host directory of the operator project. The prototype of the Tiling function is fixed. It receives a TilingContext as the input, and the input and output shape pointers can be obtained from the context. The registered tiling function is called by the framework, and the TilingContext parameter is passed during the call. The sample code is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
namespace optiling { const uint32_t BLOCK_DIM = 8; const uint32_t TILE_NUM = 8; static ge::graphStatus TilingFunc(gert::TilingContext *context) { TilingData tiling; uint32_t totalLength = context->GetInputShape(0)->GetOriginShape().GetShapeSize(); context->SetBlockDim(BLOCK_DIM); tiling.set_totalLength(totalLength); tiling.set_tileNum(TILE_NUM); tiling.SaveToBuffer(context->GetRawTilingData()->GetData(), context->GetRawTilingData()->GetCapacity()); context->GetRawTilingData()->SetDataSize(tiling.GetDataSize()); size_t *currentWorkspace = context->GetWorkspaceSizes(1); currentWorkspace[0] = 0; return ge::GRAPH_SUCCESS; } } // namespace optiling |
Perform the following steps:
- Obtain TilingContext, that is, the argument gert::TilingContext* context of the Tiling function.
- Set TilingData. After the TilingData class is defined in Step 3, you can create an instance of the class and call the set_{field_name} method to set the value of each field (field_name is the tiling field name defined in Step 3). After setting the tiling field, call the SaveToBuffer method to serialize and save the TilingData instance.
- Obtain the input and output shape information based on the context. In this sample, the GetInputShape API of TilingContext is used to obtain the input shape size.
1 2
// Obtain the input shape information. uint32_t totalLength = context->GetInputShape(0)->GetOriginShape().GetShapeSize();
- Set TilingData. Call the set_{field_name} method to set the field value of TilingData.
1 2 3 4 5
// Use TilingData to define a specific instance. TilingData tiling; // Set TilingData. tiling.set_totalLength(totalLength); tiling.set_tileNum(TILE_NUM);
- Call SaveToBuffer of the TilingData class to complete serialization and save the serialization to TilingContext. The first parameter of SaveToBuffer is the start address of the storage buffer, and the second parameter is the length of the buffer. Call GetRawTilingData to obtain the address of the untyped TilingData, and then call GetData to obtain the data pointer, which is used as the start address of the buffer. Alternatively, call GetRawTilingData to obtain the address of the untyped TilingData, and then call GetCapacity to obtain the length of the TilingData, which is used as the buffer length. After the SaveToBuffer operation is complete, you need to call SetDataSize to set the length of the TilingData. The length is obtained by calling the GetDataSize API of the TilingData class.
1 2 3
// Serialize and save the file. tiling.SaveToBuffer(context->GetRawTilingData()->GetData(), context->GetRawTilingData()->GetCapacity()); context->GetRawTilingData()->SetDataSize(tiling.GetDataSize());
- Obtain the input and output shape information based on the context. In this sample, the GetInputShape API of TilingContext is used to obtain the input shape size.
- Call the SetBlockDim API to set blockDim.
1context->SetBlockDim(BLOCK_DIM);
- (Optional) Call SetTilingKey to set the tiling key.
1context->SetTilingKey(1);
- (Optional) Call GetWorkspaceSizes to obtain the workspace size pointer and set the size. This is only an example. Set the size of workspace to 0.
1 2
size_t *currentWorkspace = context->GetWorkspaceSizes(1); currentWorkspace[0] = 0;