Multi-Streaming

Background

In recommendation inference scenarios, models are typically characterized by a high operator count and small operator shapes. Furthermore, requirements generally dictate maximizing throughput (the volume of data per inference x the number of inferences per unit time) while maintaining latency within a reasonable range. Small operator shapes often result in low compute utilization, leading to suboptimal performance. To address this, multi-stream parallel inference is typically enabled to improve hardware utilization and subsequently boost overall throughput.

Restrictions

In multi-stream parallel scenarios, resources such as bandwidth and compute resources may be preempted. As a result, the latency of a single inference is longer than that in single-stream scenarios. In latency-sensitive scenarios, latency control is required.
If multi-streaming is not combined with sub-core partitioning, significant queuing contention for computational resources may occur. Whether to use sub-core partitioning in conjunction with multi-streaming depends on specific empirical data.
The number of streams must be controlled based on the actual throughput of the model. Generally, increasing the number of streams leads to higher NPU utilization and greater hardware throughput. However, once the number of streams exceeds a certain threshold, cache thrashing can occur between different streams, resulting in a decline in overall throughput.

Cases

TensorFlow demo: You can enable the multi-stream function by referring to GitCode.com.
PyTorch demo: In the PyTorch framework, the community does not directly support multi-stream. Therefore, you are advised to use the multi-process mode for inference.

Parent topic: Performance Tuning Methods