Analyzing GPU Memory Bottlenecks

When the DeepSeek two-node cluster is used for serving inference, the actual Batchsize cannot reach 192.

Collect the serving performance data.
Observe the data changes. When Batchsize is 192, the number of Decodes increases. When the number of Decodes reaches 185, the KV blocks are about to be used up and the number of remaining KV blocks is 36 (the initial number is 1747). Then, the number of remaining KV blocks is reduced to 0, Batchsize is reduced to 72. At last, when Batchsize is 192, the reference reaches 253.
Figure 1 Viewing data
According to the actual execution, the number of Decodes can be close to 1000 when Batchsize is 150.
Figure 2 Actual execution
Estimate the upper limit of concurrency under the GPU memory bottleneck. The Profiling tool shows that the number of available KV Cache blocks is 1747. By default, the size of each block is 128 tokens. According to the context requirements of 1024 inputs and 2048 outputs, the average context is 1536, and the KV Cache can accommodate 1747/[(1024+2048)/2/128]≈145 on average.
Figure 3 Number of blocks
Adjust the serving parameters. That is, set export NPU_MEMORY_FRACTION=0.96 and maxSeqLen to 3K to achieve the optimal effect.

Parent topic: Cases for Tuning the Serving Performance