Tensor Parallelism

Tensor parallelism (TP) is a model parallelism strategy that splits tensors (such as weight matrices and activation values) among multiple devices (such as NPUs) to implement distributed model inference.

Constraints

This feature is supported by the Atlas 800I A2 inference server and Atlas 800I A3 SuperPoD Server.
DeepSeek-V3 and DeepSeek-R1 support local TP splitting of the LmHead matrix, local TP splitting of the O project matrix, and TP greater than 1.
Prefill-decode disaggregation scenarios with distributed decode nodes support local TP splitting of the LmHead and O project matrices, which reduces the matrix computation time and inference latency.
In prefill-decode disaggregation scenarios with distributed, low-latency decode nodes, if TP exceeds 1, TP splitting of MLA is supported, which reduces the decode inference latency in small-batch and low-latency scenarios.
If TP exceeds 1, this feature cannot be enabled together with local TP splitting of the O project matrix, and you are not advised to enable this feature together with local TP splitting of the LmHead matrix.

Parameters

Table 1 describes the parameters required for enabling local TP splitting of the LmHead matrix.

**Table 1** Supplementary parameters for local TP splitting of the LmHead matrix: models in ModelConfig
Parameter	Value Type	Value Range	Description
deepseekv2
parallel_options
lm_head_local_tp	Integer	[1, worldSize / Number of nodes]	TP split count for LmHead. Only the DeepSeek-R1, DeepSeek-V3, and DeepSeek-V3.1 models support this feature. The default value is -1, indicating that splitting is disabled.

Table 2 describes the parameters required for enabling local TP splitting of the O project matrix.

**Table 2** Supplementary parameters for local TP splitting of the O project matrix: models in ModelConfig
Parameter	Value Type	Value Range	Description
deepseekv2
parallel_options
o_proj_local_tp	Integer	[1, worldSize / Number of nodes]	Split count for the Attention O matrix. Only the DeepSeek-R1, DeepSeek-V3, and DeepSeek-V3.1 models support this feature. The default value is -1, indicating that splitting is disabled.

Running Inference

Open the config.json file of the Server.

cd {MindIE installation directory}/latest/mindie-service/
vi conf/config.json

Set serving parameters. Add the corresponding parameters to the config.json file of the Server based on Table 1 and Table 2. For details about the serving parameters, see Configuration Parameters (Service-Specific). The parameter configuration example is as follows.

The following uses the DeepSeek-R1 model as an example. In addition, enabling TP splitting and disabling local TP splitting of the LmHead and O project matrices are used as examples.

"ModelDeployConfig" :
{
   "maxSeqLen" : 2560,
   "maxInputTokenLen" : 2048,
   "truncation" : false,
   "ModelConfig" : [
     {
         "modelInstanceType" : "Standard",
         "modelName" : "DeepSeek-R1_w8a8",
         "modelWeightPath" : "/data/weights/DeepSeek-R1_w8a8",
         "worldSize" : 8,
         "cpuMemSize" : 5,
         "npuMemSize" : -1,
         "backendType" : "atb",
         "trustRemoteCode" : false,
         "tp": 2,
         "models": {
            "deepseekv2": {
                "parallel_options": {
                    "lm_head_local_tp": -1,
                    "o_proj_local_tp": -1
                }
            }
         }
      }
   ]
},

Start the service.
```
./bin/mindieservice_daemon
```

Parent topic: Parallelism Strategies