Host Bound Issue Classification
Introduction to Host Bound
In the torch_npu training and inference scenarios, host-side task delivery (CPU) such as operator scheduling and memory allocation and device-side task execution (NPU) are asynchronous. When the task delivery time on the host side exceeds the task execution time on the device side, the device is idle and waits for new tasks, causing a performance bottleneck, that is, a host bound issue.
Host Bound Symptom
- The operator delivery lines are vertically dense, indicating that the NPU is waiting for the CPU to deliver tasks.
- The free time proportion of the NPU is too high due to waiting.
Figure 1 Performance data in typical host bound scenarios
Host Bound Optimization Methods
Host bound issues are mainly caused by operator delivery delay and CPU overload. Table 1 describes the common optimization methods.
|
Optimization Method |
Advantage |
Description |
|---|---|---|
|
Operator delivery optimization |
Reducing the number of operator delivery times |
After identifying bottlenecks (for details, see Fast and Slow Card Troubleshooting), perform optimization by using methods such as logic optimization, equivalent computing replacement, and operator fusion. For details, see Affinity Operator Tuning Strategy. |
|
Improving the operator delivery speed |
|
|
|
CPU computing optimization |
Leveraging the advantages of heterogeneous computing |
Minimize the use of AI CPU operators and preferentially select operators with better affinity. For details, see Affinity Operator Tuning Strategy. |
|
Leveraging the advantages of parallel computing |
Promote asynchronous parallel processing between the CPU and NPU, for example, place the data processing logic in the DataLoader,. Reduce stream synchronization operations, for example, exercise caution when using operations such as item(), cpu(), and npu(), and combine or avoid using them as much as possible. |
In addition, there is a type of delivery exception. That is, the time required for operator delivery increases significantly due to factors such as resource preemption and OS scheduling policy conflicts. For details, see Task Dispatch Anomaly Analysis.