Advanced Tuning of DeepSeek
Advanced tuning brings benefits to both serving and LLM inferences. Some advanced tuning methods may require support from related components.
Analyzing the GPU Memory
After the optimization reaches the bottleneck, an intuitive method is to optimize the GPU memory to use better configurations (such as the number of concurrent tasks and parallel policy). Quantization can be used to reduce the computation workload and GPU memory. Currently, the best solution is W8A8.
Tuning the Parallel Policy
In the current 16-card inference scenario, the optimal configuration is as follows: TP=8, DP=2, MOE_TP=4, and MOE_EP=4. However, users have different hosts (Arm/x86) and input and output requirements. As a result, the optimal parallel policy changes. Therefore, the parallel policy needs to be adjusted.
Optimizing the Communication Policy
Different communication policies generate different traffic. Therefore, you need to evaluate the traffic based on the parallel policy.
Tuning suggestions:
- Minimize the TP of Attention and increase the DP to avoid KV cache replication and repeated access and storage.
- Pure DP communication can save KV cache, but the model weight needs to occupy more space.
- Expert communication (EP) is configured based on the ep_level keyword in the config.json configuration file in the ATB-Models installation directory. The theoretical AlltoAll communication traffic is less.
- Communication can be implemented through LCCL or HCCL. Generally, LCCL delivers better performance, but adaptation may be abnormal in some parallel policies.
Other Optimization Methods
- Weight format conversion: Convert the weight to the NZ format to reduce the time required for format conversion.
- Total request setting: The performance in the Decode phase increases (the request rate is limited, the batch size is small in the early stage, and the bottleneck is not reached. The batch size in the Decode phase gradually increases). Therefore, you are advised to set the total number of requests to the number of concurrent requests multiplied by 10.