Expert Parallelism
MoE models support expert parallelism (EP), which deploys experts on different devices to implement expert-level parallel computing.
Currently, two EP forms are implemented:
1. EP based on AllGather communication (ep_level = 1)
2. EP based on AllToAll and communication-computing fusion (ep_level = 2)
Constraints
- The DeepSeek-V2, DeepSeek-V3, and DeepSeek-R1 models support this feature.
- If the number of parallel experts exceeds 32, DeepSeek-V3 and DeepSeek-R1 automatically enable the grouped matmul fused operator to improve computing performance.
Parameters
Table 1 describes the serving parameters required for enabling the EP feature.
Parameter |
Value Type |
Value Range |
Description |
|---|---|---|---|
deepseekv2 |
|||
ep_level |
Integer |
[1,2] |
EP implementation form. 1: EP based on AllGather communication 2: EP based on AllToAll and communication-computing fusion NOTE:
If ep_level is set to 2 when two servers are deployed, the two servers must be connected through a switch. Otherwise, the service will fail to be started. |
enable_init_routing_cutoff |
Bool |
|
Whether to allow top k result truncation.
|
topk_scaling_factor |
Float |
(0,1] |
Top k result truncation parameter.
|
alltoall_ep_buffer_scale_factors |
list[list[int, float]] |
Each member in the list contains two numbers. The first number is a non-negative integer, and the second number is a floating-point number greater than 0. The members are sorted in descending order based on the first number. |
Size of the AllToAll communication buffer. The second-level list contains two elements. The first number is the sequence length, and the second number is the buffer coefficient. The sequence length is the condition for selecting the buffer coefficient. Example: [[1048576, 1.32], [524288, 1.4], [262144, 1.53], [131072, 1.8], [32768, 3.0], [8192, 5.2], [0, 8.0]]
|
Usage Example
Scenarios where ep_level is set to 2:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 | "ModelDeployConfig" : { "maxSeqLen" : 2560, "maxInputTokenLen" : 2048, "truncation" : false, "ModelConfig" : [ { "modelInstanceType" : "Standard", "modelName" : "DeepSeek-R1_w8a8", "modelWeightPath" : "/data/weights/DeepSeek-R1_w8a8", "worldSize" : 8, "cpuMemSize" : 5, "npuMemSize" : -1, "backendType" : "atb", "trustRemoteCode" : false, "moe_ep": 8, "models": { "deepseekv2": { "ep_level": 2, "alltoall_ep_buffer_scale_factors": [[1048576, 1.32], [524288, 1.4], [262144, 1.53], [131072, 1.8], [32768, 3.0], [8192, 5.2], [0, 8.0]] } } } ] }, |
Generally, you are not advised to add alltoall_ep_buffer_scale_factors.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | "ModelDeployConfig" : { "maxSeqLen" : 66000, "maxInputTokenLen" : 65000, "truncation" : false, "ModelConfig" : [ { "modelInstanceType" : "Standard", "modelName" : "DeepSeek-R1_w8a8", "modelWeightPath" : "/data/weights/DeepSeek-R1_w8a8", "worldSize" : 8, "cpuMemSize" : 5, "npuMemSize" : -1, "backendType" : "atb", "trustRemoteCode" : false, "moe_ep": 8, "models": { "deepseekv2": { "ep_level": 1, "enable_init_routing_cutoff": true, "topk_scaling_factor": 0.25 } } } ] }, |
Running Inference
- Set serving parameters. This feature must be used together with MindIE Motor. Add related parameters to the serving config.json file based on Table 1. For details about the path of the config.json file, see the software package file list in "MindIE Configuration" > "Server Configuration" > "Multi-Node Inference" in MindIE Installation Guide.
- Start the service. For details, see "Quick Start" > "Service Startup" in MindIE Motor Development Guide.