MoE
There are two key innovations in the traditional transformer structure for MoE. First, the sparse MoE layer replaces the feed-forward network (FFN) in the transformer structure. Each FFN works as an expert. However, during the inference of each token, only a subset of experts needs to be activated. The second innovation, the routing mechanism, is crucial for selecting which experts to activate. The router determines which expert the token will enter at each layer. Thanks to the two mechanisms, MoE models can ensure an excellent model effect due to extensive expert knowledge. Compared with traditional models with the same number of parameters, MoE models guarantee high-performance inference by activating only some experts.
Typical models of the MoE structure include Mixtral 8x7B, Mixtral 8x22B, DeepSeek-16B-MoE, DeepSeek-V2, DeepSeek-V3, DeepSeek-R1, Qwen3-30B-A3B, and Qwen3-235B-A22B.
Constraints
Supported Model |
Data Format |
Quantization |
Parallel Mode |
Hardware Platform |
Multi-Server Multi-Device Inference |
|---|---|---|---|---|---|
Mixtral 8x7B |
FP16 |
Not supported |
TP |
Atlas 800I A2 inference server |
Not supported |
Mixtral 8x22B |
FP16 |
Not supported |
TP |
Atlas 800I A2 inference server |
Not supported |
DeepSeek-16B-MoE |
FP16 |
Not supported |
TP |
Atlas 800I A2 inference server |
Not supported |
DeepSeek-V2 |
BF16 |
Supported |
TP and EP |
Atlas 800I A2 inference server |
Supported |
DeepSeek-V3 |
BF16 |
Supported |
TP and EP |
Atlas 800I A2 inference server |
Supported |
DeepSeek-R1 |
BF16 |
Supported |
TP and EP |
Atlas 800I A2 inference server |
Supported |
Qwen3-30B-A3B |
BF16 |
Supported |
TP |
Atlas 800I A2 inference server |
Not supported |
Qwen3-235B-A22B |
BF16 |
Supported |
TP |
Atlas 800I A2 inference server |
Supported |
Model configuration parameters
For details about how to configure the inherent parameters of each model, see the config.json file in their official weight file.
Running Inference
The inference method for MoE models is identical to that of other models. You can follow the traditional LLM method during inference without setting any additional parameters.
The following uses DeepSeek-16B-MoE as an example. You can run the following instructions to perform a dialog test. The inference content is "What's deep learning".
cd ${ATB_SPEED_HOME_PATH}
bash examples/models/deepseek/run_pa_deepseek_moe.sh {Model weight path}