Enabling Tiling Offload

In static graph mode, scheduling performance can be optimized by graph offloading. The complete computational graph is delivered to the device at once, and subsequent execution is performed independently on the device without the need for host participation. This reduces the overhead of host-device interaction and improves execution efficiency. For certain operators, tiling computation depends on the specific values input during runtime (tiling value dependency), requiring dynamic computation of tiling parameters during execution. In such cases, the tiling offload optimization solution can be used: The tiling computation is offloaded to the AI CPU on the device, enabling efficient computation throughout the entire process on the device.

The tiling-offload operator built based on the CANN package of the new version (supporting tiling offload) is incompatible with the operating environment of the older CANN package (not supporting tiling offload).
Currently, only fused operators (vector and cube computation fusion) support tiling offload.
Tiling offload is supported only by the following product models:
- Atlas A3 training products / Atlas A3 inference products
- Atlas A2 training products / Atlas A2 inference products

To enable tiling offload for a custom operator, perform the following steps. For details about the complete sample, see Tiling offload operator sample.

In the tiling offload scenario, the op_host directory structure of the operator project is as follows. The tiling implementation must be placed in a separate .cpp file, for example, add_custom_tiling_sink_tiling.cpp in the sample below.

├── op_host
│   ├── add_custom_tiling_sink.cpp // Operator prototype definition, as well as InferShape, and InferDataType implementation
│   ├── add_custom_tiling_sink_tiling.cpp // Tiling function implementation
│   ├── add_custom_tiling_sink_tiling.h // TilingData structure definition and tiling function declaration
│   └── CMakeLists.txt

The following uses the AddCustom operator as an example to describe how to implement the key code files:

Header file add_custom_tiling_sink_tiling.h for TilingData structure definition and tiling function declaration

Defines the TilingData structure.
Declares the tiling implementation function.

       
            #ifndef ADD_CUSTOM_TILING_SINK_TILING_H
#define ADD_CUSTOM_TILING_SINK_TILING_H
#include "register/tilingdata_base.h"
#include "register/op_def_registry.h"

namespace optiling {
BEGIN_TILING_DATA_DEF(TilingSinkTilingData)
TILING_DATA_FIELD_DEF(uint32_t, totalLength);
TILING_DATA_FIELD_DEF(uint32_t, tileNum);
END_TILING_DATA_DEF;

REGISTER_TILING_DATA_CLASS(AddCustomTilingSink, TilingSinkTilingData)  // Tiling structure definition

ge::graphStatus AddCustomSinkTilingFunc(gert::TilingContext* context);  // Tiling function declaration
} // namespace optiling
#endif // ADD_CUSTOM_TILING_SINK_TILING_H

File add_custom_tiling_sink.cpp, for operator prototype definition, as well as InferShape and InferDataType implementation, must contain add_custom_tiling_sink_tiling.h to associate the tiling function with the operator prototype definition.

Tiling offload is applicable only to the scenario where the tiling value dependency exists (that is, InferShape does not depend on input values, but tiling computation requires input values) and the operator input is not of the Const type. In this sample, the tiling value dependency for a non-Const input is configured for input y through ValueDepend.

        
         
           
           
             #include "add_custom_tiling_sink_tiling.h" // Include the header file.

// ...

namespace ops {
class AddCustomTilingSink : public OpDef {
public:
    explicit AddCustomTilingSink(const char *name) : OpDef(name)
    {
        this->Input("x")
            .ParamType(REQUIRED)
            .DataType({ge::DT_FLOAT})
            .Format({ge::FORMAT_ND});
        this->Input("y")
            .ParamType(REQUIRED)
            .DataType({ge::DT_FLOAT})
            .Format({ge::FORMAT_ND})
            .ValueDepend(OPTIONAL, DependScope::TILING); // Indicates that input y has tiling value dependency.
        this->Output("z")
            .ParamType(REQUIRED)
            .DataType({ge::DT_FLOAT})
            .Format({ge::FORMAT_ND});

        this->SetInferShape(ge::InferShape).SetInferDataType(ge::InferDataType);

        this->AICore().SetTiling(optiling::AddCustomSinkTilingFunc); // Association between the tiling function and the operator prototype definition
        
        // Replace it with the actual Ascend AI Processor model.
        this->AICore().AddConfig("ascendxxx");
    }
};
OP_ADD(AddCustomTilingSink);
} // namespace ops

            

          

        
       

Tiling function implementation file add_custom_tiling_sink_tiling.cpp

In the tiling function, whether the operator is currently in the compilation phase is determined by checking whether InputTensor (data pointer of the value-dependent input y) is null. In the tiling offload scenario, memory needs to be allocated for the operator during compilation, including the workspace required. To ensure runtime efficiency, the maximum workspace size should be properly set during compilation based on the operator's execution requirements to avoid memory shortages or waste. The AddCustomTilingSink sample does not require user workspace and therefore does not involve configuration. The fixed value is set here only as an example.
Register the tiling offload function: Include the header file device_op_impl_registry.h and use the DEVICE_IMPL_OP_OPTILING macro for registration.

       
        
          
          
            #include "add_custom_tiling_sink_tiling.h"
#include "register/device_op_impl_registry.h"
#include "tiling/platform/platform_ascendc.h"

namespace optiling {
static constexpr uint32_t BLOCK_DIM = 8;
static constexpr uint32_t TILE_NUM = 3;
static constexpr size_t MAX_WORKSPACE_SIZE = 32; // Maximum size of the user workspace required by the operator. The AddCustomTilingSink operator does not require user workspace. The fixed value is set here only as an example.
static constexpr size_t DEFAULT_WORKSPACE_SIZE = 0;
ge::graphStatus AddCustomSinkTilingFunc(gert::TilingContext *context)
{
    TilingSinkTilingData tiling;
    uint32_t totalLength = context->GetInputTensor(0)->GetShapeSize();
    context->SetBlockDim(BLOCK_DIM);
    tiling.set_totalLength(totalLength);
    tiling.set_tileNum(TILE_NUM);
    tiling.SaveToBuffer(context->GetRawTilingData()->GetData(), context->GetRawTilingData()->GetCapacity());
    context->GetRawTilingData()->SetDataSize(tiling.GetDataSize());
    size_t *currentWorkspace = context->GetWorkspaceSizes(1);
    auto platform = platform_ascendc::PlatformAscendC(context->GetPlatformInfo());
    size_t sysWorkspaceSize = platform.GetLibApiWorkSpaceSize();
    currentWorkspace[0] = sysWorkspaceSize + DEFAULT_WORKSPACE_SIZE; // Set the workspace size at runtime, which is the sum of the system workspace size and user workspace size.
    if (context->GetInputTensor(1) != nullptr && context->GetInputTensor(1)->GetData<float>() == nullptr) {
        // Determine whether the operator is currently in the compilation phase by checking whether the data of InputTensor is a null pointer.
        // In the tiling offload scenario, memory needs to be allocated for the operator during compilation, including the workspace required. To ensure runtime efficiency, the maximum workspace size should be properly set during compilation based on the operator's execution requirements to avoid memory shortages or waste.
        currentWorkspace[0] = sysWorkspaceSize + MAX_WORKSPACE_SIZE; // Set the workspace size during compilation, which is the sum of the system workspace size and the maximum user workspace size.
    }
    return ge::GRAPH_SUCCESS;
}
DEVICE_IMPL_OP_OPTILING(AddCustomTilingSink).Tiling(optiling::AddCustomSinkTilingFunc); // Register the tiling offload function.
} // namespace optiling

           

         

       
      

Operator kernel function implementation

Currently, only fused operators support tiling offload. To emulate the fused operator scenario, KERNEL_TASK_TYPE_DEFAULT is called to force the operator to run in the AIC and AIV hybrid scenario.

        
             extern "C" __global__ __aicore__ void add_custom_tiling_sink(GM_ADDR x, GM_ADDR y, GM_ADDR z, GM_ADDR workspace, GM_ADDR tiling)
{
    GET_TILING_DATA(tiling_data, tiling);
    KERNEL_TASK_TYPE_DEFAULT(KERNEL_TYPE_MIX_AIC_1_2); // Force the operator to run in the AIC and AIV hybrid scenario to emulate the fused operator scenario.
    if ASCEND_IS_AIC {
        return;
    }
    AscendC::KernelAdd op;
    op.Init(x, y, z, tiling_data.totalLength, tiling_data.tileNum);
    op.Process();
}

Modify the build script CMakeLists.txt in the op_host directory and add a tiling offload compilation command. The code is as follows:

# Add a tiling offload compilation task using ascendc_device_library.
ascendc_device_library( TARGET cust_opmaster # Task name, which is fixed to cust_opmaster.
                        OPTION SHARED # Dynamic library (Currently, only dynamic libraries support graph offload.)
                        SRC ${CMAKE_CURRENT_SOURCE_DIR}/add_custom_tiling_sink_tiling.cpp ) # Source file of the tiling function implementation code

Parent topic: Integrating Operators into a GE Graph