Expert Parallelism

MoE models support expert parallelism (EP), which deploys experts on different devices to implement expert-level parallel computing.

Currently, two EP forms are implemented:

1. EP based on AllGather communication (ep_level = 1)

2. EP based on AllToAll and communication-computing fusion (ep_level = 2)

Constraints

  • The DeepSeek-V2, DeepSeek-V3, and DeepSeek-R1 models support this feature.
  • If the number of parallel experts exceeds 32, DeepSeek-V3 and DeepSeek-R1 automatically enable the grouped matmul fused operator to improve computing performance.

Parameters

Table 1 describes the serving parameters required for enabling the EP feature.

Table 1 Supplementary parameters of the EP feature: models in ModelConfig

Parameter

Value Type

Value Range

Description

deepseekv2

ep_level

Integer

[1,2]

EP implementation form.

1: EP based on AllGather communication

2: EP based on AllToAll and communication-computing fusion

NOTE:

If ep_level is set to 2 when two servers are deployed, the two servers must be connected through a switch. Otherwise, the service will fail to be started.

enable_init_routing_cutoff

Bool

  • true
  • false

Whether to allow top k result truncation.

  • The default value is false (disabling the feature).
  • This parameter can be set when ep_level is set to 1.

topk_scaling_factor

Float

(0,1]

Top k result truncation parameter.

  • When ep_level is set to 1, the latter part of hidden_states of each device is invalid data. You can set the truncation parameter to reduce the graphics memory overhead.
  • In addition, enable_init_routing_cutoff must be set to true.

alltoall_ep_buffer_scale_factors

list[list[int, float]]

Each member in the list contains two numbers. The first number is a non-negative integer, and the second number is a floating-point number greater than 0.

The members are sorted in descending order based on the first number.

Size of the AllToAll communication buffer. The second-level list contains two elements. The first number is the sequence length, and the second number is the buffer coefficient. The sequence length is the condition for selecting the buffer coefficient. Example:

[[1048576, 1.32], [524288, 1.4], [262144, 1.53], [131072, 1.8], [32768, 3.0], [8192, 5.2], [0, 8.0]]

  • You are advised to configure this parameter when ep_level is set to 2 and you need to manage the graphics memory in a refined manner.
  • This parameter does not take effect when ep_level is set to 1.

Usage Example

Scenarios where ep_level is set to 2:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
"ModelDeployConfig" :
{
   "maxSeqLen" : 2560,
   "maxInputTokenLen" : 2048,
   "truncation" : false,
   "ModelConfig" : [
     {
         "modelInstanceType" : "Standard",
         "modelName" : "DeepSeek-R1_w8a8",
         "modelWeightPath" : "/data/weights/DeepSeek-R1_w8a8",
         "worldSize" : 8,
         "cpuMemSize" : 5,
         "npuMemSize" : -1,
         "backendType" : "atb",
         "trustRemoteCode" : false,
         "moe_ep": 8,
         "models": {
             "deepseekv2": {
                 "ep_level": 2,
                 "alltoall_ep_buffer_scale_factors": [[1048576, 1.32], [524288, 1.4], [262144, 1.53], [131072, 1.8], [32768, 3.0], [8192, 5.2], [0, 8.0]]
             }
         }
      }
   ]
},

Generally, you are not advised to add alltoall_ep_buffer_scale_factors.

Long-sequence scenarios where ep_level is set to 1:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
"ModelDeployConfig" :
{
   "maxSeqLen" : 66000,
   "maxInputTokenLen" : 65000,
   "truncation" : false,
   "ModelConfig" : [
     {
         "modelInstanceType" : "Standard",
         "modelName" : "DeepSeek-R1_w8a8",
         "modelWeightPath" : "/data/weights/DeepSeek-R1_w8a8",
         "worldSize" : 8,
         "cpuMemSize" : 5,
         "npuMemSize" : -1,
         "backendType" : "atb",
         "trustRemoteCode" : false,
         "moe_ep": 8,
         "models": {
             "deepseekv2": {
                 "ep_level": 1,
                 "enable_init_routing_cutoff": true,
                 "topk_scaling_factor": 0.25
             }
         }
      }
   ]
},

Running Inference

  1. Set serving parameters. This feature must be used together with MindIE Motor. Add related parameters to the serving config.json file based on Table 1. For details about the path of the config.json file, see the software package file list in "MindIE Configuration" > "Server Configuration" > "Multi-Node Inference" in MindIE Installation Guide.
  2. Start the service. For details, see "Quick Start" > "Service Startup" in MindIE Motor Development Guide.