Tensor Parallelism

Tensor parallelism (TP) is a model parallelism strategy that splits tensors (such as weight matrices and activation values) among multiple devices (such as NPUs) to implement distributed model inference.

Constraints

  • This feature is supported by the Atlas 800I A2 inference server and Atlas 800I A3 SuperPoD Server.
  • DeepSeek-V3 and DeepSeek-R1 support local TP splitting of the LmHead matrix, local TP splitting of the O project matrix, and TP greater than 1.
  • Prefill-decode disaggregation scenarios with distributed decode nodes support local TP splitting of the LmHead and O project matrices, which reduces the matrix computation time and inference latency.
  • In prefill-decode disaggregation scenarios with distributed, low-latency decode nodes, if TP exceeds 1, TP splitting of MLA is supported, which reduces the decode inference latency in small-batch and low-latency scenarios.
  • If TP exceeds 1, this feature cannot be enabled together with local TP splitting of the O project matrix, and you are not advised to enable this feature together with local TP splitting of the LmHead matrix.

Parameters

Table 1 describes the parameters required for enabling local TP splitting of the LmHead matrix.

Table 1 Supplementary parameters for local TP splitting of the LmHead matrix: models in ModelConfig

Parameter

Value Type

Value Range

Description

deepseekv2

parallel_options

lm_head_local_tp

Integer

[1, worldSize / Number of nodes]

TP split count for LmHead.

  • Only the DeepSeek-R1, DeepSeek-V3, and DeepSeek-V3.1 models support this feature.
  • The default value is -1, indicating that splitting is disabled.

Table 2 describes the parameters required for enabling local TP splitting of the O project matrix.

Table 2 Supplementary parameters for local TP splitting of the O project matrix: models in ModelConfig

Parameter

Value Type

Value Range

Description

deepseekv2

parallel_options

o_proj_local_tp

Integer

[1, worldSize / Number of nodes]

Split count for the Attention O matrix.

  • Only the DeepSeek-R1, DeepSeek-V3, and DeepSeek-V3.1 models support this feature.
  • The default value is -1, indicating that splitting is disabled.

Running Inference

  1. Open the config.json file of the Server.
    cd {MindIE installation directory}/latest/mindie-service/
    vi conf/config.json
  2. Set serving parameters. Add the corresponding parameters to the config.json file of the Server based on Table 1 and Table 2. For details about the serving parameters, see Configuration Parameters (Service-Specific). The parameter configuration example is as follows.

    The following uses the DeepSeek-R1 model as an example. In addition, enabling TP splitting and disabling local TP splitting of the LmHead and O project matrices are used as examples.

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    "ModelDeployConfig" :
    {
       "maxSeqLen" : 2560,
       "maxInputTokenLen" : 2048,
       "truncation" : false,
       "ModelConfig" : [
         {
             "modelInstanceType" : "Standard",
             "modelName" : "DeepSeek-R1_w8a8",
             "modelWeightPath" : "/data/weights/DeepSeek-R1_w8a8",
             "worldSize" : 8,
             "cpuMemSize" : 5,
             "npuMemSize" : -1,
             "backendType" : "atb",
             "trustRemoteCode" : false,
             "tp": 2,
             "models": {
                "deepseekv2": {
                    "parallel_options": {
                        "lm_head_local_tp": -1,
                        "o_proj_local_tp": -1
                    }
                }
             }
          }
       ]
    },
    
  3. Start the service.
    ./bin/mindieservice_daemon