Host Bound Issue Classification

Introduction to Host Bound

In the torch_npu training and inference scenarios, host-side task delivery (CPU) such as operator scheduling and memory allocation and device-side task execution (NPU) are asynchronous. When the task delivery time on the host side exceeds the task execution time on the device side, the device is idle and waits for new tasks, causing a performance bottleneck, that is, a host bound issue.

Host Bound Symptom

  1. The operator delivery lines are vertically dense, indicating that the NPU is waiting for the CPU to deliver tasks.
  2. The free time proportion of the NPU is too high due to waiting.
    Figure 1 Performance data in typical host bound scenarios

Host Bound Optimization Methods

Host bound issues are mainly caused by operator delivery delay and CPU overload. Table 1 describes the common optimization methods.

Table 1 Common optimization methods for host bound

Optimization Method

Advantage

Description

Operator delivery optimization

Reducing the number of operator delivery times

After identifying bottlenecks (for details, see Fast and Slow Card Troubleshooting), perform optimization by using methods such as logic optimization, equivalent computing replacement, and operator fusion. For details, see Affinity Operator Tuning Strategy.

Improving the operator delivery speed

  • Pipeline Optimization: Migrate some operator adaptation tasks to the level-2 pipeline to balance the load of the two levels and reduce the time required for task dequeue and wakeup. It is a common and efficient optimization method.
  • Core Binding Optimization: Optimize the task execution efficiency by configuring the processor affinity (that is, core binding) of operator tasks on the CPU side, avoiding cross-NUMA memory access and reducing task scheduling overhead.
  • Compilation Optimization: The link-time optimization (LTO) and profile-guided optimization (PGO) technologies of the BiSheng Compiler are used to compile and build the source code of Python, torch (PyTorch), and torch_npu (Ascend Extension for PyTorch), effectively improving program performance.

CPU computing optimization

Leveraging the advantages of heterogeneous computing

Minimize the use of AI CPU operators and preferentially select operators with better affinity. For details, see Affinity Operator Tuning Strategy.

Leveraging the advantages of parallel computing

Promote asynchronous parallel processing between the CPU and NPU, for example, place the data processing logic in the DataLoader,. Reduce stream synchronization operations, for example, exercise caution when using operations such as item(), cpu(), and npu(), and combine or avoid using them as much as possible.

In addition, there is a type of delivery exception. That is, the time required for operator delivery increases significantly due to factors such as resource preemption and OS scheduling policy conflicts. For details, see Task Dispatch Anomaly Analysis.