Accuracy Tuning Process
This section describes how to tune the accuracy when the ported model functions properly on the Ascend AI Processor but still suffers from accuracy issues or poor convergence.
Background
After the ported model is trained on Ascend AI Processor (NPU for short) and functions properly, the accuracy may not meet the requirements or the convergence effect may be poor. When the model runs on Ascend AI Processor, the issues that may be encountered include but are not limited to the following:
- The loss curve differs greatly from that of the benchmark model.
- The validation accuracy differs greatly from that of the benchmark model.
These accuracy issues are difficult to locate due to the following reasons:
- The training is completed without exceptions.
- No warning or error is recorded in the logs.
- The differences are found only during comparison with the benchmark model.
This section provides guidance for you to tune the accuracy.
Tuning Approach
Accuracy problems result from various aspects, for example, the provided benchmark model, an error occurred during model porting, or operator accuracy on the network. The following flowchart summarizes the workflow for accuracy tuning with the possible causes highlighted.

No. |
Step |
Description |
|---|---|---|
1 |
Check the following items before accuracy tuning:
|
|
2 |
Before accuracy tuning, install one-click accuracy analyzer on your training NPU. |
|
3 |
During training, frequent floating-point exceptions may occur, such as a sharp drop in Loss Scale or a direct drop to 1. In such cases, analyze the overflow data to identify the cause of the exception. |
|
4 |
At network run time, the system fuses operators according to built-in fusion patterns for better network performance. As most fusions are proceeded automatically, it is possible that your model contains an operator that is not yet covered by the fusion implementations, which impacts model accuracy. You can disable fusion to determine whether the problem happens in operator fusion phase. |
|
5 |
If the accuracy still does not meet expectations after the above steps, collect operator execution results (dump data) during training and compare them with results from the benchmark operator (such as TensorFlow). This helps quickly pinpoint operators with accuracy issues. |
|
6 |
At network run time, the calculation with the same inputs may produce different outputs. If such random issues happen, run training twice and collect operator results (dump data) from both runs. Compare the results to quickly identify operators that cause randomness. |