Checklist

Before you start accuracy debugging, use the following checklist to exclude errors with the benchmark model or model porting process.

Table 1 Pre-debugging checklist

No.

Item

Description

Result

Benchmark model script check

1

Constant Validation Accuracy

The benchmark model should be able to offer constant predictions. If the benchmark model does not meet this requirement, it is not competent to offer the accuracy benchmark.

Passed/Failed/Not checked

2

Mixed Precision Training

As Ascend AI Processor (or NPU) hardware architecture supports only mixed precision training for the user model, the user model needs to be trained with mixed precision. If mixed precision training is not enabled or not enabled successfully for the user model, the NPU may fail to train the model or the accuracy of the trained model may not meet the expectation.

Passed/Failed/Not checked

Ported script check

3

Mixed Precision Training on NPU

The model is successfully ported to the NPU before accuracy debugging. Ensure that distributed training (if involved) is enabled, and mixed precision training is enabled during porting.

Passed/Failed/Not checked

4

Loss Scaling on NPU

Loss scaling must be enabled in the script migrated to the NPU. Generally, the LossScaleManager parameters also need to be adjusted appropriately to ensure accuracy, as the NPU computation characteristics differ from those of the GPU/CPU in mixed-precision computation.

Passed/Failed/Not checked

5

Dataset Processing

Check the dataset integrity. The training dataset is always large and easily to get incomplete.

Passed/Failed/Not checked

6

Data Preprocessing

The data preprocessing part of your code may have an automatically-set resource-based variable, which will lead to different dataset shuffle orders. Check the API calls related to data preprocessing in the code to minimize the difference.

Passed/Failed/Not checked

7

Shard Method

The data preprocessing part of the user model code may shard datasets to different nodes based on file name or number of files. This results in significant sharding discrepancies, or even duplicate file shards being assigned to different nodes, because the file read API sorts file names differently across nodes. Add debugging code to exclude such problems, ensuring the sharding policy consistent with that of the benchmark model.

Passed/Failed/Not checked

8

Training Procedure

During training, a process error such as not clearing intermediate activations occurs frequently, which causes accuracy difference from the benchmark model. Get familiar with the training process and check your training and validation steps.

Passed/Failed/Not checked

9

Model Hyperparameters

The hyperparameters set in the ported script may differ from those set in the benchmark model. Ensure that the hyperparameters in use are the same as those set in the benchmark model.

Passed/Failed/Not checked