MoE

There are two key innovations in the traditional transformer structure for MoE. First, the sparse MoE layer replaces the feed-forward network (FFN) in the transformer structure. Each FFN works as an expert. However, during the inference of each token, only a subset of experts needs to be activated. The second innovation, the routing mechanism, is crucial for selecting which experts to activate. The router determines which expert the token will enter at each layer. Thanks to the two mechanisms, MoE models can ensure an excellent model effect due to extensive expert knowledge. Compared with traditional models with the same number of parameters, MoE models guarantee high-performance inference by activating only some experts.

Typical models of the MoE structure include Mixtral 8x7B, Mixtral 8x22B, DeepSeek-16B-MoE, DeepSeek-V2, DeepSeek-V3, DeepSeek-R1, Qwen3-30B-A3B, and Qwen3-235B-A22B.

Constraints

For details about the supported feature capabilities, see Table 1.

**Table 1** **Supported capabilities:**
Supported Model	Data Format	Quantization	Parallel Mode	Hardware Platform	Multi-Server Multi-Device Inference
Mixtral 8x7B	FP16	Not supported	TP	Atlas 800I A2 inference server	Not supported
Mixtral 8x22B	FP16	Not supported	TP	Atlas 800I A2 inference server	Not supported
DeepSeek-16B-MoE	FP16	Not supported	TP	Atlas 800I A2 inference server	Not supported
DeepSeek-V2	BF16	Supported	TP and EP	Atlas 800I A2 inference server	Supported
DeepSeek-V3	BF16	Supported	TP and EP	Atlas 800I A2 inference server	Supported
DeepSeek-R1	BF16	Supported	TP and EP	Atlas 800I A2 inference server	Supported
Qwen3-30B-A3B	BF16	Supported	TP	Atlas 800I A2 inference server	Not supported
Qwen3-235B-A22B	BF16	Supported	TP	Atlas 800I A2 inference server	Supported

Model configuration parameters

For details about how to configure the inherent parameters of each model, see the config.json file in their official weight file.

Running Inference

The inference method for MoE models is identical to that of other models. You can follow the traditional LLM method during inference without setting any additional parameters.

The following uses DeepSeek-16B-MoE as an example. You can run the following instructions to perform a dialog test. The inference content is "What's deep learning".

cd ${ATB_SPEED_HOME_PATH}
bash examples/models/deepseek/run_pa_deepseek_moe.sh {Model weight path}

Parent topic: MoE