Training with Mixed Precision

Introduction to Mixed Precision

Mixed precision is a common way to improve performance in the industry. It increases the data computing parallelism by reducing some computing precisions. Mixed precision is the combined use of the float16 and float32 data types in training deep neural networks, which reduces memory usages and accesses. Training with mixed precision presents itself as a better choice for training large networks without compromising the network accuracy produced by float32.

You can enable the mixed precision by configuring precision_mode_v2 or precision_mode in the script.

For details about precision_mode and precision_mode_v2, see Accuracy Tuning.

If automatic mixed precision is enabled, you are advised to enable the LossScaleOptimizer to compensate for the accuracy loss caused by precision reduction. For details about how to port the LossScaleOptimizer, see Replacing LossScaleOptimizer. To analyze profile data, manually modify the operator precision mode. You can refer to Modifying the Blocklist and Trustlist for Mixed Precision to specify operators to reduce or preserve the precision.

Setting the Precision Mode

This section uses setting precision_mode_v2 to mixed_float16 as an example to describe how to set the mixed precision mode.

Before initializing the NPU, set precision_mode_v2 in your training script by referring to Accuracy Tuning.

1
2
3
import npu_device as npu
npu.global_options().precision_mode_v2 = 'mixed_float16'  # Enables automatic mixed precision, indicating that both float16 and float32 are used for neural network processing.
npu.open().as_default()

Modifying the Blocklist and Trustlist for Mixed Precision

When automatic mixed precision is enabled, the system automatically reduces the precisions of some data types on a network based on the built-in tiling policy. This improves the system performance while reducing the memory usage at low accuracy loss.

Find the built-in tiling policy in /opp/built-in/op_impl/ai_core/tbe/config/<soc_version>/aic-<soc_version>-ops-info.json under the CANN installation directory.
1
2
3
4
"Conv2D":{
    "precision_reduce":{
        "flag":"true"
},
  • Scenarios where precision_mode_v2 is set to mixed_float16 and precision_mode is set to allow_mix_precision_fp16/allow_mix_precision:
    • If the field value is true, the operator is on the mixed precision trustlist and its precision will be reduced from float32 to float16.
    • If the field value is false, the operator is on the mixed precision blocklist and its precision will not be reduced from float32 to float16.
    • If an operator does not have the precision_reduce option configured, the operator is on the graylist and will follow the same precision processing as the upstream operator.

You can specify operators to reduce or preserve the precision based on the built-in tuning policy.

  • (Recommended) Use modify_mixlist to modify the blocklist, trustlist, and graylist of mixed precision.

    Before initializing the NPU, set modify_mixlist in your training script to modify the blocklist, trustlist, and graylist of mixed precision by referring to Accuracy Tuning. The following is an example:

    1
    2
    3
    import npu_device as npu
    npu.global_options().modify_mixlist = "/home/test/ops_info.json"
    npu.open().as_default()
    
    ops_info.json is the configuration file of the blocklist, trustlist, and graylist for mixed precision. Multiple operators are separated by commas (,). An example is as follows:
    {
      "black-list": {                  // Blocklist
         "to-remove": [                // Move an operator from the blocklist to the graylist.
         "Xlog1py"
         ],
         "to-add": [                   // Move an operator from the trustlist or graylist to the blocklist.
         "Matmul",
         "Cast"
         ]
      },
      "white-list": {                  // Trustlist
         "to-remove": [                // Move an operator from the trustlist to the graylist.
         "Conv2D"
         ],
         "to-add": [                   // Move an operator from the blocklist or graylist to the trustlist.
         "Bias"
         ]
      }
    }

    Assume that operator A is in the trustlist by default. If you want to move it to the blocklist, follow any of the positive examples below:

    1. (Positive example) Directly add the operator to the blocklist.
      1
      2
      3
      4
      5
      {
        "black-list": { 
           "to-add": ["A"]
        }
      }
      

      The operator will be deleted from the trustlist and added to the blocklist. You can find it in the blocklist.

    2. (Positive example) Delete the operator from the trustlist and add it to the blocklist.
      1
      2
      3
      4
      5
      6
      7
      8
      {
        "black-list": {
           "to-add": ["A"]
        },
        "white-list": {
           "to-remove": ["A"]
        }
      }
      

      The operator will be deleted from the trustlist and added to the blocklist. You can find it in the blocklist.

    3. (Negative example) Simply delete the operator from the trustlist. In this case, the operator will be moved to the graylist instead of the blocklist.
      1
      2
      3
      4
      5
      {
        "white-list": {
           "to-remove": ["A"]
        }
      }
      

      The operator will be deleted from the trustlist and added to the graylist.

      If an operator is simply removed from the blocklist or trustlist, it will be added to the graylist.

  • Modify the operator information library.

    Modifying the built-in operator information library may affect other networks. Proceed with caution.

    1. Go to /opp/built-in/op_impl/ai_core/tbe/config/<soc_version> under the CANN installation directory.
    2. Grant the write permission on the aic-<soc_version>-ops-info.json file.
      chmod u+w aic-<soc_version>-ops-info.json

      All .json files in the current directory will be loaded to the operator information library. If you need to back up the original .json files, back them up to another directory.

    3. Modify or add the precision_reduce field of the corresponding operator in the aic-<soc_version>-ops-info.json file in the operator information library.