Inconsistent Model Structure

Case: After an MoE model is migrated from the GPU to the NPU, the loss is not matched.

Figure 1 Loss mismatch

Locating method: Check the code or print the model structure for comparison.

The code review reveals a discrepancy: the residual layer in the NPU model follows input_layernorm, whereas in the GPU model, it precedes input_layernorm. The sequence structures of the two models are different.

Figure 2 Model structure comparison

Solution: Place input_layernorm in the NPU after residual.

Result: The loss is matched after the model structure is matched.

Figure 3 Loss matched