Accuracy Tuning Workflow
This section describes how to tune the accuracy when the ported model functions properly on Ascend AI Processor but the accuracy does not meet requirements or the convergence effect is poor.
Background of Accuracy Tuning
After the ported model is trained on Ascend AI Processor (NPU for short) and functions properly, the accuracy may not meet the requirements or the convergence effect may be poor. When the ported model is executed on Ascend AI Processor, the problems that may be encountered include but are not limited to the following:
- The loss curve differs greatly from that of the benchmark model.
- The validation accuracy differs greatly from that of the benchmark model.
These accuracy issues are difficult to locate due to the following reasons:
- The training is completed without exceptions.
- No warning or error is recorded in logs.
- The differences are found only when compared with the benchmark model.
This section provides guidance for you to tune the accuracy.
Accuracy Tuning Analysis
Accuracy problems result from various aspects, for example, the provided benchmark model, an error occurred during model porting, or operator accuracy on the network. The following flowchart summarizes the workflow for accuracy tuning with the possible causes highlighted.

No. |
Step |
Description |
|---|---|---|
1 |
Check the following items before accuracy tuning:
|
|
2 |
Before accuracy tuning, install One-Click Accuracy Analyzer on your training NPU. |
|
3 |
At network run time, floating-point exceptions happen from time to time. That is, the loss scaling decreases many times or directly to 1. In this case, analyze the overflow and underflow data to determine the problem source. |
|
4 |
At network run time, the system fuses operators according to built-in fusion patterns for better network performance. As most fusions are proceeded automatically, it is possible that your model contains an operator that is not yet covered by the fusion implementations, which impacts model accuracy. You can disable fusion to determine whether the problem happens in operator fusion phase. |
|
5 |
If the accuracy problem does not happen in the steps above, dump the compute result of each operator during the training process and compare the dump data with that of each benchmark operator (such as the TensorFlow equivalents) to quickly spot the faulty operators. |
|
6 |
At network run time, the calculation with the same inputs may produce different outputs. If such random errors happen, you can perform training twice, collect the compute result (that is, dump data) of each operator, and compare the data to quickly locate the fishy operator layer that causes the errors. |