Tensor Parallelism
Tensor parallelism (TP) is a model parallelism strategy that splits tensors (such as weight matrices and activation values) among multiple devices (such as NPUs) to implement distributed model inference.
Constraints
- This feature is supported by the Atlas 800I A2 inference server and Atlas 800I A3 SuperPoD Server.
- DeepSeek-V3 and DeepSeek-R1 support local TP splitting of the LmHead matrix, local TP splitting of the O project matrix, and TP greater than 1.
- Prefill-decode disaggregation scenarios with distributed decode nodes support local TP splitting of the LmHead and O project matrices, which reduces the matrix computation time and inference latency.
- In prefill-decode disaggregation scenarios with distributed, low-latency decode nodes, if TP exceeds 1, TP splitting of MLA is supported, which reduces the decode inference latency in small-batch and low-latency scenarios.
- If TP exceeds 1, this feature cannot be enabled together with local TP splitting of the O project matrix, and you are not advised to enable this feature together with local TP splitting of the LmHead matrix.
Parameters
Table 1 describes the parameters required for enabling local TP splitting of the LmHead matrix.
Parameter |
Value Type |
Value Range |
Description |
|---|---|---|---|
deepseekv2 |
|||
parallel_options |
|||
lm_head_local_tp |
Integer |
[1, worldSize / Number of nodes] |
TP split count for LmHead.
|
Table 2 describes the parameters required for enabling local TP splitting of the O project matrix.
Parameter |
Value Type |
Value Range |
Description |
|---|---|---|---|
deepseekv2 |
|||
parallel_options |
|||
o_proj_local_tp |
Integer |
[1, worldSize / Number of nodes] |
Split count for the Attention O matrix.
|
Running Inference
- Open the config.json file of the Server.
cd {MindIE installation directory}/latest/mindie-service/ vi conf/config.json - Set serving parameters. Add the corresponding parameters to the config.json file of the Server based on Table 1 and Table 2. For details about the serving parameters, see Configuration Parameters (Service-Specific). The parameter configuration example is as follows.
The following uses the DeepSeek-R1 model as an example. In addition, enabling TP splitting and disabling local TP splitting of the LmHead and O project matrices are used as examples.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
"ModelDeployConfig" : { "maxSeqLen" : 2560, "maxInputTokenLen" : 2048, "truncation" : false, "ModelConfig" : [ { "modelInstanceType" : "Standard", "modelName" : "DeepSeek-R1_w8a8", "modelWeightPath" : "/data/weights/DeepSeek-R1_w8a8", "worldSize" : 8, "cpuMemSize" : 5, "npuMemSize" : -1, "backendType" : "atb", "trustRemoteCode" : false, "tp": 2, "models": { "deepseekv2": { "parallel_options": { "lm_head_local_tp": -1, "o_proj_local_tp": -1 } } } } ] },
- Start the service.
./bin/mindieservice_daemon