Checklist
Before you start accuracy tuning, use the following checklist to exclude errors with the benchmark model or model porting process.
No. |
Item |
Description |
Result |
|---|---|---|---|
Benchmark model script check |
|||
1 |
The benchmark model should be able to offer constant predictions. If the benchmark model does not meet this requirement, it is not competent to offer the accuracy benchmark. |
Passed/Failed/Not checked |
|
2 |
As Ascend AI Processor (or NPU) hardware architecture supports only mixed precision training for the user model, the user model needs to be trained with mixed precision. If mixed precision training is not enabled or not enabled successfully for the user model, the NPU may fail to train the model or the accuracy of the trained model may not meet the expectation. |
Passed/Failed/Not checked |
|
Ported script check |
|||
3 |
Before accuracy tuning, ensure that the model is successfully migrated to the NPU. Ensure that distributed training (if involved) is enabled, and mixed precision training is enabled during model migration. |
Passed/Failed/Not checked |
|
4 |
Loss scaling must be enabled in the script migrated to the NPU. Generally, the LossScaleManager parameters need to be configured, as the NPU differs from the GPU in mixed precision computing. |
Passed/Failed/Not checked |
|
6 |
Check the dataset integrity. The training dataset is always large and easily to get incomplete. |
Passed/Failed/Not checked |
|
7 |
The data preprocessing part of your code may have an automatically-set resource-based variable, which will lead to different dataset shuffle orders. Check the API calls related to data preprocessing in the code to minimize the difference. |
Passed/Failed/Not checked |
|
8 |
The data preprocessing part of the user model code may shard datasets to different nodes based on file name or number of files. This results in large differences between the user model and benchmark model or even files sharded repeatedly to a single node, as the file read API sorts file names differently on different nodes. Add debugging code to exclude such problems, ensuring the sharding policy consistent with that of the benchmark model. |
Passed/Failed/Not checked |
|
9 |
During training, a process error such as not clearing intermediate activations occurs frequently, which causes accuracy difference from the benchmark model. Get familiar with the training process and check your training and validation steps. |
Passed/Failed/Not checked |
|
10 |
The hyperparameters set in the ported script differ from those set in the benchmark model. Ensure that the hyperparameters in use are the same as those set in the benchmark model. |
Passed/Failed/Not checked |
|