MTP

MTP is a parallel decoding method used by DeepSeek to generate multiple tokens at a time. The core idea of MTP is that a model forecasts not just the subsequent token but several tokens concurrently during inference, which markedly enhances generation efficiency.

Constraints

  • This feature is supported by the Atlas 800I A2 inference server and Atlas 800I A3 SuperPoD Server.
  • Only the W8A8 and KV cache INT8 quantization models of DeepSeek-R1 and DeepSeek-V3 support this feature.
  • This feature supports W4A8 quantization.
  • This feature cannot be used with parallel decoding, Multi-LoRA, SplitFuse, or long sequence.
  • This feature does not support postprocessing parameters related to multi-sequence inference, such as n, best_of, use_beam_search, and logprobs.
  • MTP postprocessing supports only repetition penalty.

Parameters

Table 1 describes the parameters required for enabling the MTP feature.

Table 1 Supplementary parameters of MTP: ModelConfig in ModelDeployConfig

Parameter

Value Type

Value Range

Description

plugin_params

std::string

plugin_type: mtp

num_speculative_tokens: [1]

  • plugin_type: mtp indicates that MTP is enabled.
  • num_speculative_tokens indicates the number of MTP layers. The value can be 1 or 2.
  • If no plugin function is required, remove this field from the configuration.

Configuration example:

{\"plugin_type\":\"mtp\",\"num_speculative_tokens\": 1}

[Note] num_speculative_tokens configuration suggestions:

In low-latency scenarios, you can set it to 1 or 2. In high-throughput scenarios, you are advised to set this parameter to 1.

Running Inference

  1. Open the config.json file of the Server.
    cd {MindIE installation directory}/latest/mindie-service/
    vi conf/config.json
  2. Set serving parameters. Add the plugin_params field (the following content in bold) to the config.json file of the Server. For details about the fields, see Table 1. For details about the serving parameters, see Configuration Parameters (Service-Specific). The following is a parameter configuration example:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    "ModelDeployConfig" :
    {
       "maxSeqLen" : 2560,
       "maxInputTokenLen" : 2048,
       "truncation" : false,
       "ModelConfig" : [
         {
             "plugin_params": "{\"plugin_type\":\"mtp\",\"num_speculative_tokens\": 1}",
             "modelInstanceType" : "Standard",
             "modelName" : "DeepSeek-R1_w8a8",
             "modelWeightPath" : "/data/weights/DeepSeek-R1_w8a8",
             "worldSize" : 8,
             "cpuMemSize" : 5,
             "npuMemSize" : -1,
             "backendType" : "atb",
             "trustRemoteCode" : false
          }
       ]
    },
    
  3. Start the service.
    ./bin/mindieservice_daemon