Check Before Porting

In the scenario where a model is ported from the GPU/CPU to the NPU for training, perform the following checks to eliminate potential problems during the porting.

  1. Check the consistency of the model training accuracy.

    Perform multiple training runs on the GPU/CPU and NPU. If the accuracy of the GPU/CPU and that of the NPU fluctuate within the same range, the NPU training does not have accuracy problems. If the average accuracy of the GPU/CPU is significantly higher than that of the NPU and the difference exceeds the normal fluctuation range, the NPU training may have accuracy problems.

  2. Check the model configuration after porting.
    • Ensure that the mixed precision mode on the NPU is the same as that on the GPU. Use the precision_mode_v2 option, with value origin. For details, see the "Session Configuration Parameters" in TF Adapter APIs (1.x).
    • Ensure that the loss scale function is correctly enabled on the NPU. If the LossScaleManager is used on the GPU/CPU to calculate dynamic loss scale, the NPULossScaleOptimizer must be used on the NPU. For details, see the "Loss Scale" in TensorFlow 1.15 Model Porting Guide.
    • Ensure that the configurations, such as the dataset, data preprocessing mode, and model hyperparameters, used for training on the GPU/CPU and NPU are the same, except for the API modifications involved in the porting.
  3. Use the high-precision mode.

    If the accuracy problem persists after the preceding checks are complete, enable the high-precision mode for NPU training and perform training again to check whether the problem is caused by the operator precision mode.

    Training configuration example in session.run mode:

    custom_op.parameter_map["op_select_implmode"].s = tf.compat.as_bytes("high_precision")

    For details, see the "Session Configuration Parameters" in TF Adapter APIs (1.x).

    Training configuration example in Estimator mode:

    config = NPURunConfig(op_select_implmode="high_precision")

    For details, see the "NPURunConfig Configuration Parameters" in TF Adapter APIs (1.x).