--tiling_schedule_optimize

Applicability

Product	Supported
Atlas A3 training products/Atlas A3 inference products	√
Atlas A2 training products/Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	x
Atlas inference products	√
Atlas training products	x

Description

Sets whether to enable the optimization for tiling offload scheduling.

When tiling offload is implemented, the tiling computation is performed on the CPU on the device. As internal storage of the AI Cores in the NPU cannot store all the input and output data of operators, the input data is tiled into different parts. The first part is transferred in, computed, and then transferred out, so does the next part. This process is called tiling. Then, a computation program, called tiling implementation, determines tiling parameters (such as the block size transferred each time and the total number of cycles) based on operator information such as shape. The AI Core is not good at scalar computation in the tiling implementation. Therefore, tiling implementation is generally executed on the CPU on the host. However, tiling implementation is executed on the device when the following conditions are met:

The model is static-shape.
Operators in the model, such as the FusedInferAttentionScore and IncreFlashAttention fused operators, support tiling offload.
The output values of the operators that support tiling offload have dependencies, that is, the output value of the previous operator contains the execution result of the device. If the value to be depended on is a Const value, tiling offload is not required, and tiling is completed during build.

Argument

0 (default): Disables tiling offload.
1: Enables Tiling offload.

Suggestions and Benefits

None

Example

--tiling_schedule_optimize=1