Case 1

Background

The customer's data is allocated from the main drive to a cluster at a fixed percentage. The cluster has several NPUs to process the requests. The performance optimization aims to reduce the number of cards in the cluster and ensure that the request processing latency does not exceed the threshold.

Analysis Process

Step 1: Local data test

  1. The batching policy is implemented in the customer's server framework. You can modify the batch accumulation time and level through parameters.
  2. The maximum level means that each batch size from 1 to the maximum level is configured as a separate level.
  3. The number of streams indicates the number of instances on an NPU, that is, the maximum number of inference jobs that can be executed concurrently.
  4. The first number indicates the number of cubes, and the second number indicates the number of vectors.
  5. The performance values in the last three columns are the QPS values calculated by 1000/Average time x Number of streams.

    According to the preceding table, when the maximum batch size is 16 and the number of streams is 3, the QPS is the highest. Note that the data is obtained from a local even pressure test. The batch size distribution does not represent the actual situation online. Therefore, the data may be different from the actual performance online and can only be used as a targeted analysis method.

Step 2: Online batch size distribution

The following figure shows the batch size data captured from the online environment. It can be seen that most batch sizes are small. 90% of the batch sizes are less than 20, and 86% of the batch sizes are less than 16.

Based on the local test performance, you can set the maximum batch size to 16 and the number of streams to 3.

Step 3: Online batch size performance tuning

According to the actual data, the NPU latency deteriorates seriously during peak hours. That is, the latency increases with the traffic. Therefore, the utilization may be suspected.

According to the local test data (12/24 core division scenario), when a request with a single batch size of 20 is received, the processing on the NPU is as follows:

Used Instances

bs

Latency

1

20

9.19

2

10

7.01

  • If the maximum batch size is set to 20, the request can be processed at a time, which takes 9.19 ms. Only 12/24 cores are occupied.
  • If the maximum batch size is set to 16, the request is split into two batch sizes of 10. The two batch sizes of 10 are processed concurrently, which takes 7.01 ms. The time is less than that of the batch size 20. However, the occupied hardware resources are 24/48 cores, which is twice that of the batch size 20. As a result, the NPU usage is too high.

After the maximum batch size is modified, the actual average latency does not change significantly, but the average NPU usage decreases by 5%. The latency deterioration during peak hours is significantly improved.