Accuracy Issue Overview and Scenarios
Large language models (LLMs) like ChatGPT and DeepSeek are advancing rapidly, making them an important development direction in the AI field. Training LLMs demands high computing power, involving data, models, frameworks, operators, and hardware. Training at such a large scale can be challenging and may lead to accuracy issues.
Several factors can cause training accuracy issues. The main symptoms are that the training convergence is not as expected and sharp increases in loss, spikes, NaN errors, and poor performance on downstream evaluation tasks.
Training Accuracy Scenarios
- With benchmarks: Migrate an LLM or other types of deep neural networks trained on a benchmark (such as GPU or other training frameworks) to the NPU.
- Without benchmarks: Develop and train models on the NPU.
This document focuses on the mainstream migration scenarios. The NPU training process and result are inconsistent with those on the benchmark (GPU or other frameworks of the NPU) and the deviation exceeds the tolerance threshold. This is called mismatching. The scenarios can be further classified into the following types:
- First step difference
The loss of step 0 or the first several steps is different from that of the benchmark, and the average error is greater than 1%, as shown in the following figure.Figure 1 First step difference
- Long-term loss difference
The loss is matched in the early stage, but differs in the later stage. The average error is greater than 1%, as shown in the following figure.Figure 2 Long-term loss difference
- Overflow or NaN
Overflow, NaN, or spike occurs more frequently after the migration compared with the benchmark, as shown in the following figure.Figure 3 Overflow or NaN
- Training losses remain similar, but downstream metrics show significant differences.
Note that even if the same issue occurs, the root causes are complex and different. This document describes the overall troubleshooting roadmap and standard process for locating accuracy issues in model training, and provides typical troubleshooting cases and detailed process to allow you to quickly understand the troubleshooting process and methods.