Sub-core Partitioning

Background

Taking the NPU card in an A2 server as an example, a single card contains 24 AI Cube Cores and 48 AI Vector Cores. During operator execution, the required number of cores is calculated based on the tensor shape; typically, all 24 Cube Cores or 48 Vector Cores are utilized. Using more cores reduces the workload per core, thereby decreasing the overall operator execution latency.

For CTR inference scenarios involving multi-instance concurrency, performance analysis must consider total resource occupancy (core count) rather than just latency. For instance, if a Cube operator takes 100 μs using 24 cores, the NPU can only execute that single operator during that 100 μs window. Conversely, if the same operator takes 150 μs when limited to 12 cores, two such operators can be executed within 150 μs, resulting in an average execution time of 75 μs, a clear improvement.

The primary benefits of core limiting include:

For microsecond-level small operators, execution is dominated by startup overhead; using more cores increases this scheduling overhead.
Since CTR operator shapes are generally small, partitioning ensures fewer "mini-tilings" per core, which prevents computational waste in "tail blocks."
Inter-core access to the same memory address can cause contention, significantly reducing throughput. For indexing operators like Gather, high core counts exacerbate contention if data distribution is concentrated.
Cores should not be overly restricted. Generally, limits are set between 6–12 cores for Vector and Cube units respectively.

Restrictions

In single-stream scenarios, partitioned inference latency is typically higher (degraded) than non-partitioned latency. In latency-sensitive scenarios, the degree of latency degradation must be strictly managed according to specific requirements.

Cases

TensorFlow demo: For details about how to modify the method, see the description of the aicore_num parameter in the "Session Configuration Parameters" in TF Adapter APIs (1.x).
PyTorch demo: This feature is not supported.

Parent topic: Performance Tuning Methods