Restricting the Size of the TilingData Structure

[Priority] Medium

[Description] The TilingData structure is the carrier of tiling data. After the host computes tiling based on the tiling strategy, the operator passes tiling data from the host to the kernel as an input parameter. In this case, the tiling data is stored on the GM. The efficiency of access to the GM is low. After the GET_TILING_DATA macro is called, the tiling data is copied from the GM to the stack space of the AI Processor. During this period, the copy overhead is generated and the stack space is limited. Therefore, the size of the TilingData structure needs to be limited. The copy time is at the µs level. The optimization benefits are even more obvious in small-shape scenarios.

To limit the size of the TilingData structure, consider the following methods:

Reduce unnecessary TilingData structure variables.
Select a proper variable type based on the tiling data range.
Arrange the TilingData structure properly.

[Negative Example]

In the following example, there are redundant TilingData structure variables. BlockDim has been set through SetBlockDim and can be obtained by calling GetBlockNum on the kernel. BlockDim does not need to be passed through the TilingData structure.
The data type of the variable is improper. formerNum and tailLength are the number of cores for computing the entire data block and the number of cores for computing the tail data block, respectively. The values of formerNum and tailLength do not exceed the value of BLOCK_DIM and should use the uint8_t type. Based on the compute logic, variables such as formerLength do not exceed the range of uint32_t and should use the uint32_t type.

// Tiling structure definition
BEGIN_TILING_DATA_DEF(TilingDataUnalign)
  TILING_DATA_FIELD_DEF(uint64_t, blockDim);
  TILING_DATA_FIELD_DEF(uint64_t, formerNum);
  TILING_DATA_FIELD_DEF(uint64_t, tailNum);
  TILING_DATA_FIELD_DEF(uint64_t, formerLength);
  TILING_DATA_FIELD_DEF(uint64_t, tailLength);
  TILING_DATA_FIELD_DEF(uint64_t, alignNum);
END_TILING_DATA_DEF;

// The tiling function on the host computes the tiling structure information.
constexpr uint32_t BLOCK_DIM = 8;
constexpr uint32_t SIZE_OF_HALF = 2;
constexpr uint32_t BLOCK_SIZE = 32;
constexpr uint32_t ALIGN_NUM = BLOCK_SIZE / SIZE_OF_HALF;
static ge::graphStatus TilingFunc(gert::TilingContext *context)
{
    TilingDataUnalign tiling;
    uint32_t totalLength = context->GetInputTensor(0)->GetShapeSize();
    // BlockDim has been set through SetBlockDim.
    context->SetBlockDim(BLOCK_DIM);
    uint32_t totalLengthAligned = ((totalLength + ALIGN_NUM - 1) / ALIGN_NUM) * ALIGN_NUM;
    // Ensure that the values of formerNum and tailNum do not exceed the value range of 0–BLOCK_DIM.
    uint32_t formerNum = (totalLengthAligned / ALIGN_NUM) % BLOCK_DIM;
    uint32_t tailNum = BLOCK_DIM - formerNum;
    // Variables such as formerLength do not exceed the range of uint32_t based on the compute logic.
    uint32_t formerLength = ((totalLengthAligned / BLOCK_DIM + ALIGN_NUM - 1) / ALIGN_NUM) * ALIGN_NUM;
    uint32_t tailLength = (totalLengthAligned / BLOCK_DIM / ALIGN_NUM) * ALIGN_NUM;
    ...
}

[Positive Example]

There are no redundant tiling variables, and variable data types are minimized.

BEGIN_TILING_DATA_DEF(TilingDataUnalign)
  TILING_DATA_FIELD_DEF(uint8_t, formerNum);
  TILING_DATA_FIELD_DEF(uint8_t, tailNum); 
  TILING_DATA_FIELD_DEF(uint32_t, formerLength);
  TILING_DATA_FIELD_DEF(uint32_t, tailLength);
  TILING_DATA_FIELD_DEF(uint32_t, alignNum);
END_TILING_DATA_DEF;

[Negative Example]

In the following example, the TilingData structure is improper. Because the memory access of the AI Processor requires 8-byte alignment, after you define the TilingData structure, the Ascend C project framework pads bytes in 8-byte alignment mode and ensures that the overall TilingData structure meets the 8-byte alignment requirement. 3 bytes are padded to the formerNum and tailNum variables in the following TilingData structure. 4 bytes are padded to the overall TilingData structure due to 8-byte alignment. That is, a total of 10 bytes are padded to the TilingData structure.

BEGIN_TILING_DATA_DEF(TilingDataUnalign)
  TILING_DATA_FIELD_DEF(uint8_t, formerNum); // 3 bytes need to be padded to ensure correct access to the formerLength variable.
  TILING_DATA_FIELD_DEF(uint32_t, formerLength);
  TILING_DATA_FIELD_DEF(uint8_t, tailNum); // 3 bytes need to be padded to ensure correct access to the tailLength variable.
  TILING_DATA_FIELD_DEF(uint32_t, tailLength);
  TILING_DATA_FIELD_DEF(uint32_t, alignNum);// 4 bytes need to be padded to ensure correct access to the TilingData variable.
END_TILING_DATA_DEF;

[Positive Example]

In the following example, byte arrangement is proper after adjustment of the tiling parameters. Only 2 bytes need to be padded, which reduces the TilingData structure.

BEGIN_TILING_DATA_DEF(TilingDataUnalign)
  TILING_DATA_FIELD_DEF(uint8_t, formerNum);
  TILING_DATA_FIELD_DEF(uint8_t, tailNum); // 2 bytes need to be padded to ensure correct access to the formerLength variable.
  TILING_DATA_FIELD_DEF(uint32_t, formerLength);
  TILING_DATA_FIELD_DEF(uint32_t, tailLength);
  TILING_DATA_FIELD_DEF(uint32_t, alignNum);
END_TILING_DATA_DEF;

Parent topic: Memory optimization