Model Offload Scheduling

Introduction

When an AI model runs, the CPU and an AI processor (such as the NPU, also called Ascend AI Processor) operate in tandem. The CPU on the host handles complex logic and control, while the NPU on a device excels at highly parallel, compute-intensive tasks. Through efficient compute scheduling, the host and device coordinate seamlessly, delivering higher model performance and significantly improving resource usage of heterogeneous systems.

The host CPU sequentially sends operators from the model to the device for execution. Each operator is represented as one or more tasks in an execution stream. The Ascend AI processor sequentially executes these tasks from the stream. This type of host scheduling requires frequent interaction between the host and device. In actual training or inference scenarios, the model runs multiple times to trigger the host to deliver all operators on the model.

Figure 1 Host scheduling

Static-shape model offloading (where input tensor shape remains fixed): With this approach, input and output shapes of all operators in a static-shape model can be determined at build time, enabling the host to complete operator tiling computations in advance. When coupled with the Ascend memory reuse algorithm, memory can be orchestrated at the model level. In this context, the GE uses static graph offloading scheduling. Operators are formatted as graphs at the build time and then sent to device streams when the model is loaded, but they are not executed immediately. Instead, the execution is triggered once a model execution task is delivered by the host. Compared with host-bound scheduling, offload scheduling greatly reduces the scheduling overhead on the host and effectively reduces the required interactions between the host and device.

Figure 2 Static-graph offload scheduling

Principles

Model offload scheduling consists of two phases: model loading and model execution.

Model loading: Similar to host scheduling, the model traverses all operators in the graph and deliver them to a device stream in a batch. However, the operators are not executed immediately. Model loading is a one-off action. The model is loaded when the model is executed for the first time, as shown in process 1 in the preceding figure.
Model execution: After the model is loaded, a model execution task can be delivered to an execution stream. This process is similar to delivering operators one at a time. When the task arrives at the Ascend AI processor (E in the execution stream in the preceding figure), the task is executed (process 3 in the preceding figure). If you need to run the model for multiple times, you only need to deliver the model execution task (process 2 in the preceding figure) for multiple times.

The following figure shows the timing comparison between host-bound scheduling and model offload scheduling. It can be seen that E2E model offload execution requires less time than that used for host-bound scheduling. As there is header overhead for model delivery at the beginning of model offload execution.

The following figure shows the host/device time sequence analysis of model offload scheduling.

Each time a model is delivered, the Feature Map buffer address and input/output buffer addresses can be updated. If the Feature Map buffer and input/output buffer of the model are updated, the operator-related addresses will be updated during the model offload header overhead (m_l_t in the preceding figure, which will be described later).

Parent topic: Concepts and Principles