MoE

There are two key innovations in the traditional transformer structure for MoE. First, the sparse MoE layer replaces the feed-forward network (FFN) in the transformer structure. Each FFN works as an expert. However, during the inference of each token, only a subset of experts needs to be activated. The second innovation, the routing mechanism, is crucial for selecting which experts to activate. The router determines which expert the token will enter at each layer. Thanks to the two mechanisms, MoE models can ensure an excellent model effect due to extensive expert knowledge. Compared with traditional models with the same number of parameters, MoE models guarantee high-performance inference by activating only some experts.

Typical models of the MoE structure include Mixtral 8x7B, Mixtral 8x22B, DeepSeek-16B-MoE, DeepSeek-V2, DeepSeek-V3, DeepSeek-R1, Qwen3-30B-A3B, and Qwen3-235B-A22B.

Constraints

For details about the supported feature capabilities, see Table 1.
Table 1 Supported capabilities:

Supported Model

Data Format

Quantization

Parallel Mode

Hardware Platform

Multi-Server Multi-Device Inference

Mixtral 8x7B

FP16

Not supported

TP

Atlas 800I A2 inference server

Not supported

Mixtral 8x22B

FP16

Not supported

TP

Atlas 800I A2 inference server

Not supported

DeepSeek-16B-MoE

FP16

Not supported

TP

Atlas 800I A2 inference server

Not supported

DeepSeek-V2

BF16

Supported

TP and EP

Atlas 800I A2 inference server

Supported

DeepSeek-V3

BF16

Supported

TP and EP

Atlas 800I A2 inference server

Supported

DeepSeek-R1

BF16

Supported

TP and EP

Atlas 800I A2 inference server

Supported

Qwen3-30B-A3B

BF16

Supported

TP

Atlas 800I A2 inference server

Not supported

Qwen3-235B-A22B

BF16

Supported

TP

Atlas 800I A2 inference server

Supported

Model configuration parameters

For details about how to configure the inherent parameters of each model, see the config.json file in their official weight file.

Running Inference

The inference method for MoE models is identical to that of other models. You can follow the traditional LLM method during inference without setting any additional parameters.

The following uses DeepSeek-16B-MoE as an example. You can run the following instructions to perform a dialog test. The inference content is "What's deep learning".

cd ${ATB_SPEED_HOME_PATH}
bash examples/models/deepseek/run_pa_deepseek_moe.sh {Model weight path}