MTP
MTP is a parallel decoding method used by DeepSeek to generate multiple tokens at a time. The core idea of MTP is that a model forecasts not just the subsequent token but several tokens concurrently during inference, which markedly enhances generation efficiency.
Constraints
- This feature is supported by the Atlas 800I A2 inference server and Atlas 800I A3 SuperPoD Server.
- Only the W8A8 and KV cache INT8 quantization models of DeepSeek-R1 and DeepSeek-V3 support this feature.
- This feature supports W4A8 quantization.
- This feature cannot be used with parallel decoding, Multi-LoRA, SplitFuse, or long sequence.
- This feature does not support postprocessing parameters related to multi-sequence inference, such as n, best_of, use_beam_search, and logprobs.
- MTP postprocessing supports only repetition penalty.
Parameters
Table 1 describes the parameters required for enabling the MTP feature.
Parameter |
Value Type |
Value Range |
Description |
|---|---|---|---|
plugin_params |
std::string |
plugin_type: mtp num_speculative_tokens: [1] |
Configuration example: {\"plugin_type\":\"mtp\",\"num_speculative_tokens\": 1} [Note] num_speculative_tokens configuration suggestions: In low-latency scenarios, you can set it to 1 or 2. In high-throughput scenarios, you are advised to set this parameter to 1. |
Running Inference
- Open the config.json file of the Server.
cd {MindIE installation directory}/latest/mindie-service/ vi conf/config.json - Set serving parameters. Add the plugin_params field (the following content in bold) to the config.json file of the Server. For details about the fields, see Table 1. For details about the serving parameters, see Configuration Parameters (Service-Specific). The following is a parameter configuration example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
"ModelDeployConfig" : { "maxSeqLen" : 2560, "maxInputTokenLen" : 2048, "truncation" : false, "ModelConfig" : [ { "plugin_params": "{\"plugin_type\":\"mtp\",\"num_speculative_tokens\": 1}", "modelInstanceType" : "Standard", "modelName" : "DeepSeek-R1_w8a8", "modelWeightPath" : "/data/weights/DeepSeek-R1_w8a8", "worldSize" : 8, "cpuMemSize" : 5, "npuMemSize" : -1, "backendType" : "atb", "trustRemoteCode" : false } ] },
- Start the service.
./bin/mindieservice_daemon
Parent topic: Acceleration Features