--tiling_schedule_optimize
Applicability
Product |
Supported |
|---|---|
√ |
|
√ |
|
x |
|
√ |
|
x |
Description
Sets whether to enable the optimization for tiling offload scheduling.
When tiling offload is implemented, the tiling computation is performed on the CPU on the device. As internal storage of the AI Cores in the NPU cannot store all the input and output data of operators, the input data is tiled into different parts. The first part is transferred in, computed, and then transferred out, so does the next part. This process is called tiling. Then, a computation program, called tiling implementation, determines tiling parameters (such as the block size transferred each time and the total number of cycles) based on operator information such as shape. The AI Core is not good at scalar computation in the tiling implementation. Therefore, tiling implementation is generally executed on the CPU on the host. However, tiling implementation is executed on the device when the following conditions are met:
- The model is static-shape.
- Operators in the model, such as the FusedInferAttentionScore and IncreFlashAttention fused operators, support tiling offload.
- The output values of the operators that support tiling offload have dependencies, that is, the output value of the previous operator contains the execution result of the device. If the value to be depended on is a Const value, tiling offload is not required, and tiling is completed during build.
See Also
None
Argument
- 0 (default): Disables tiling offload.
- 1: Enables Tiling offload.
Suggestions and Benefits
None
Example
--tiling_schedule_optimize=1