Accuracy Tuning Process

This section describes how to tune the accuracy when the ported model functions properly on the Ascend AI Processor but still suffers from accuracy issues or poor convergence.

Background

After the ported model is trained on Ascend AI Processor (NPU for short) and functions properly, the accuracy may not meet the requirements or the convergence effect may be poor. When the model runs on Ascend AI Processor, the issues that may be encountered include but are not limited to the following:

  • The loss curve differs greatly from that of the benchmark model.
  • The validation accuracy differs greatly from that of the benchmark model.

These accuracy issues are difficult to locate due to the following reasons:

  • The training is completed without exceptions.
  • No warning or error is recorded in the logs.
  • The differences are found only during comparison with the benchmark model.

This section provides guidance for you to tune the accuracy.

Tuning Approach

Accuracy problems result from various aspects, for example, the provided benchmark model, an error occurred during model porting, or operator accuracy on the network. The following flowchart summarizes the workflow for accuracy tuning with the possible causes highlighted.

Table 1 Accuracy tuning process

No.

Step

Description

1

Pre-tuning Check

Check the following items before accuracy tuning:

  • Original script: Verify that the benchmark model is qualified.
  • Ported script: Check that no errors occur during model porting.

2

Model Accuracy Analyzer Deployment

Before accuracy tuning, install one-click accuracy analyzer on your training NPU.

3

Floating-Point Exception Detection

During training, frequent floating-point exceptions may occur, such as a sharp drop in Loss Scale or a direct drop to 1. In such cases, analyze the overflow data to identify the cause of the exception.

4

Fusion Exception Detection

At network run time, the system fuses operators according to built-in fusion patterns for better network performance. As most fusions are proceeded automatically, it is possible that your model contains an operator that is not yet covered by the fusion implementations, which impacts model accuracy. You can disable fusion to determine whether the problem happens in operator fusion phase.

5

Network-wide Accuracy Comparison

If the accuracy still does not meet expectations after the above steps, collect operator execution results (dump data) during training and compare them with results from the benchmark operator (such as TensorFlow). This helps quickly pinpoint operators with accuracy issues.

6

Random Error Detection

At network run time, the calculation with the same inputs may produce different outputs. If such random issues happen, run training twice and collect operator results (dump data) from both runs. Compare the results to quickly identify operators that cause randomness.