Checklist

Before you start accuracy tuning, use the following checklist to exclude errors with the benchmark model or model porting process.

Table 1 Pre-tuning checklist

No.

Item

Description

Result

Benchmark model script check

1

Constant Validation Accuracy

The benchmark model should be able to offer constant predictions. If the benchmark model does not meet this requirement, it is not competent to offer the accuracy benchmark.

Passed/Failed/Not checked

2

Mixed Precision Training

As Ascend AI Processor (or NPU) hardware architecture supports only mixed precision training for the user model, the user model needs to be trained with mixed precision. If mixed precision training is not enabled or not enabled successfully for the user model, the NPU may fail to train the model or the accuracy of the trained model may not meet the expectation.

Passed/Failed/Not checked

Ported script check

3

Mixed Precision Training on the NPU

Before accuracy tuning, ensure that the model is successfully migrated to the NPU. Ensure that distributed training (if involved) is enabled, and mixed precision training is enabled during model migration.

Passed/Failed/Not checked

4

Loss Scaling on NPU

Loss scaling must be enabled in the script migrated to the NPU. Generally, the LossScaleManager parameters need to be configured, as the NPU differs from the GPU in mixed precision computing.

Passed/Failed/Not checked

6

Dataset Processing

Check the dataset integrity. The training dataset is always large and easily to get incomplete.

Passed/Failed/Not checked

7

Data Preprocessing

The data preprocessing part of your code may have an automatically-set resource-based variable, which will lead to different dataset shuffle orders. Check the API calls related to data preprocessing in the code to minimize the difference.

Passed/Failed/Not checked

8

Shard Method

The data preprocessing part of the user model code may shard datasets to different nodes based on file name or number of files. This results in large differences between the user model and benchmark model or even files sharded repeatedly to a single node, as the file read API sorts file names differently on different nodes. Add debugging code to exclude such problems, ensuring the sharding policy consistent with that of the benchmark model.

Passed/Failed/Not checked

9

Training Procedure

During training, a process error such as not clearing intermediate activations occurs frequently, which causes accuracy difference from the benchmark model. Get familiar with the training process and check your training and validation steps.

Passed/Failed/Not checked

10

Model Hyperparameters

The hyperparameters set in the ported script differ from those set in the benchmark model. Ensure that the hyperparameters in use are the same as those set in the benchmark model.

Passed/Failed/Not checked