Check Before Porting
In the scenario where a model is ported from the GPU/CPU to the NPU for training, perform the following checks to eliminate potential problems during the porting.
- Check the consistency of the model training accuracy.
Perform multiple training runs on the GPU/CPU and NPU. If the accuracy of the GPU/CPU and that of the NPU fluctuate within the same range, the NPU training does not have accuracy problems. If the average accuracy of the GPU/CPU is significantly higher than that of the NPU and the difference exceeds the normal fluctuation range, the NPU training may have accuracy problems.
- Check the model configuration after porting.
- Ensure that the mixed precision mode on the NPU is the same as that on the GPU. Use the precision_mode_v2 option, with value origin. For details, see the "Session Configuration Parameters" in TF Adapter APIs (1.x).
- Ensure that the loss scale function is correctly enabled on the NPU. If the LossScaleManager is used on the GPU/CPU to calculate dynamic loss scale, the NPULossScaleOptimizer must be used on the NPU. For details, see the "Loss Scale" in TensorFlow 1.15 Model Porting Guide.
- Ensure that the configurations, such as the dataset, data preprocessing mode, and model hyperparameters, used for training on the GPU/CPU and NPU are the same, except for the API modifications involved in the porting.
- Use the high-precision mode.
If the accuracy problem persists after the preceding checks are complete, enable the high-precision mode for NPU training and perform training again to check whether the problem is caused by the operator precision mode.
Training configuration example in session.run mode:
custom_op.parameter_map["op_select_implmode"].s = tf.compat.as_bytes("high_precision")For details, see the "Session Configuration Parameters" in TF Adapter APIs (1.x).
Training configuration example in Estimator mode:
config = NPURunConfig(op_select_implmode="high_precision")
For details, see the "NPURunConfig Configuration Parameters" in TF Adapter APIs (1.x).