Accuracy Tuning Workflow

Background of Accuracy Tuning

After the ported model is trained on Ascend AI Processor (NPU for short) and functions properly, the accuracy may not meet the requirements or the convergence effect may be poor. When the ported model is executed on Ascend AI Processor, the problems that may be encountered include but are not limited to the following:

  • The loss curve differs greatly from that of the benchmark model.
  • The validation accuracy differs greatly from that of the benchmark model.

These accuracy issues are difficult to locate due to the following reasons:

  • The training is completed without exceptions.
  • No warning or error is recorded in logs.
  • The differences are found only when compared with the benchmark model.

This section provides guidance for you to tune the accuracy.

Accuracy Tuning Analysis

The possible causes of accuracy issues are as follows:

  1. Bad benchmark model
  2. Improper model porting
  3. Operator accuracy errors

The following flowchart summarizes the workflow for accuracy tuning with the possible causes highlighted.

Table 1 Main steps of accuracy tuning

No.

Step

Description

1

Pre-tuning Check

Check the following items before accuracy tuning:

  • Unported script: Check that the benchmark model is qualified.
  • Ported script: Check that no errors occur during model porting.

2

One-Click Accuracy Analyzer Deployment

Before accuracy tuning, install One-Click Accuracy Analyzer on your training NPU.

3

Floating-Point Exception Detection

At network run time, floating-point exceptions happen from time to time. That is, the loss scale decreases many times or directly to 1. In this case, analyze the overflow and underflow data to determine the problem source.

4

Fusion Exception Detection

At network run time, the system fuses operators according to built-in fusion patterns for better network performance. As most fusions are proceeded automatically, it is possible that your model contains an operator that is not yet covered by the fusion implementations, which impacts model accuracy. You can disable fusion to determine whether the problem happens in operator fusion phase.

5

Network Accuracy Comparison

If the accuracy problem does not happen in the steps above, dump the compute result of each operator during the training process and compare the dump data with that of each benchmark operator (such as the TensorFlow equivalents) to quickly spot the faulty operators.

6

Random Error Detection

At network run time, the calculation with the same inputs may produce different outputs. If such random errors happen, you can perform training twice, collect the compute result (that is, dump data) of each operator, and compare the data to quickly locate the fishy operator layer that causes the errors.