Inconsistent First Step Loss (or Inconsistent Inference with the Same Weight)
Before using the tool for fault locating, rectify the configuration item problems in Checking the Environment and the randomness problems in Reproducing an Issue.
Case: The loss is not matched in the first step of a speech model.

Locating method: Use the dump tool of msprobe to collect the mix-level data at step 0. The config.json file is as follows:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | { "task": "statistics", "dump_path": "/home/data_dump", "rank": [], "step": [0], "level": "mix", "enable_dataloader": false, "statistics": { "scope": [], "list": [], "data_mode": ["all"], "summary_mode": "statistics" } } |
- For details about how to insert the dump tool in the code, see the following method.Figure 2 Inserting the dump tool in the code

- Analyze the differences using the hierarchical visualization tool.Create the compare.json file and copy the following content to the file:
1 2 3 4 5
{ "npu_path": "./npu_dump", "bench_path": "./bench_dump", "is_print_compare_log": true }
The visualization command is as follows:msprobe -f pytorch graph -i ./compare.json -o ./output
You can see the generated .vis file in the output directory. Open TensorBoard to open the visualization page.
Figure 3 Visualizing the comparison result
The GELU operator appears in red, signaling potential accuracy issues.
- You can also use Model Accuracy Analyzer to compare the accuracy. The compare.json configuration is as follows:
1 2 3 4 5 6
{ "npu_path": "./npu_dump/dump.json", "bench_path": "./bench_dump/dump.json", "stack_path": "./npu_dump/stack.json", "is_print_compare_log": true }
You can compare the results of a single device directly by the dump.json file. If multiple devices are used, compare the results by steps.
Run the following command to obtain the comparison result in CSV format:msprobe -f pytorch compare -i ./compare.json -o ./output -s
The table shows that the input difference of the GELU operator is small, but the output difference is large.
Figure 4 Accuracy comparison results
Solution:
The GELU operator is transferred to the CPU, which fixed the accuracy issue but reduced performance. It can be confirmed that the issue is caused by the GELU operator.
Contact the operator technical support to obtain the GELU repair package of PyTorch.
Result: The accuracy meets the requirement without transferring the operator to the CPU.