Gradient Splitting Tuning

This section provides guidance on tuning gradients in TensorFlow-based training scenarios, including tuning precautions, environment variable configuration, tuning commands, and restrictions.

Tuning Precautions

  1. Ensure that the training script can be successfully executed on the Ascend AI Processor and the function and accuracy meet the expectation.
  2. Before gradient splitting tuning, disable the profiling function and the static memory allocation mode (that is, set static_memory_policy to 0, for example, custom_op.parameter_map["static_memory_policy"].i = 0) to avoid affecting the tuning result. Gradient splitting tuning supports only the static graph mode. For details about how to disable the profiling function, see the Performance Tuning Tool User Guide .
  3. Before performing gradient splitting tuning, ensure that the training script is a distributed script.
  4. If the gradient splitting tuning has been performed on the current network and the repository information has been generated, you do not need to enable the tuning again.
  5. AOE does not allow different users to use the same device for tuning at the same time.
  6. If there is only one AOE process, ensure that the available disk space in the home directory of the user who performs tuning is greater than or equal to 20 GB. If there are multiple AOE processes, increase the disk space accordingly.
  7. Gradient splitting tuning does not support model parallelism.

Environment Variable Configuration

Before using the TFAdapter to initiate tuning, you need to add the following environment variables:
  • Basic environment variables of the CANN software

    The CANN portfolio provides a process-level environment variable setting script to automatically set environment variables. The following commands are used as examples, in which the default installation paths are under the root or non-root user. Replace them with actual installation paths.

    # Install Toolkit as the root user.
    . /usr/local/Ascend/ascend-toolkit/set_env.sh 
    # Install Toolkit as a non-root user.
    . ${HOME}/Ascend/ascend-toolkit/set_env.sh 
  • AOE depends on Python. Take Python 3.7.5 as an example. Run the following commands as the running user to configure the environment variables related to Python 3.7.5:
    # Set the Python 3.7.5 library path.
    export LD_LIBRARY_PATH=/usr/local/python3.7.5/lib:$LD_LIBRARY_PATH
    # If multiple Python 3 versions exist in the user environment, use Python 3.7.5.
    export PATH=/usr/local/python3.7.5/bin:$PATH

    Replace the Python 3.7.5 installation path based on the actual requirements. You can also write the preceding commands to the ~/.bashrc file and run the source ~/.bashrc command to make the modification take effect immediately.

  • Configure the tuning mode.
    #Tuning mode. The value 4 indicates GDAT tuning. This environment variable is mandatory.
    export AOE_MODE=4
    # (Optional) Specify the storage path of the custom repository after tuning.
    export TUNE_BANK_PATH=/home/HwHiAiUser/custom_tune_bank
    Table 1 Environment variables

    Environment Variable

    Description

    Mandatory/Optional

    AOE_MODE

    Tuning mode. The value can be 4 (GDAT tuning).

    Mandatory

    TUNE_BANK_PATH

    Path of the custom repository generated after tuning.

    The path must be an absolute path or a relative path to the path of the AOE tuning engine. The path must exist and the user must have the read, write, and execute permissions on the path. If the path specified by TUNE_BANK_PATH does not exist or the user does not have the required permissions on the path, the tuning process will report an error and exit.
    • If this environment variable is not configured, the custom repository is stored in ${HOME}/Ascend/latest/data/aoe/custom/graph/${soc_version} by default.
    • If this environment variable is configured, the custom repository is stored in the path specified by this environment variable. An absolute path starts with a slash (/), for example, /home/HwHiAiUser/gdat/output.
    NOTE:

    If multiple users share the repository, the users must set TUNE_BANK_PATH to the same path and have the read and write permissions on the path.

    If the repository path is customized before tuning, you also need to configure this environment variable if you want to use the custom repository during model conversion.

    Optional

    You can write the commands for configuring environment variables to the custom script for future use.

Tuning Procedure

  1. Ensure that the collective communication operators exist during the training when the script is executed on a single device.

    Check whether the script contains either of the following settings:

    • Manually inserted collective communication operator, for example:
      from npu_bridge.hccl import hccl_ops
      from npu_bridge.npu_init import *
      if get_npu_rank_size() > 0:
         result = hccl_ops.allreduce(tensor, "sum")
    • Collective communication operator inserted by the distributed optimizer, for example:
      from npu_bridge.estimator.npu.npu_optimizer import NPUOtimizer
      from npu_bridge.npu_init import *
      # Define the optimizer.
      optimizer = LAMBOptimizer(......)
      optimizer = NPUOptimizer(optimizer,is_distributed=true)
  2. Run the distributed training script on a single device.

    The following uses ResNet50_HC on device 0 as an example:

    "for((RANK_ID_n=$RANK_ID_START;RANK_ID_n<$((RANK_SIZE+RANK_ID_START));RANK_ID_n++))"

    Change it to:

    for((RANK_ID_n=$RANK_ID_START;RANK_ID_n<$((1+RANK_ID_START));RANK_ID_n++))
  3. Run the training script to generate a custom repository.
    Key log information about tuning during training is as follows:
    # Enable TFAdapter tuning.
    in tune mode, training graph handled by tools
    # Start the tool for tuning.
    Aoe tuning graph.

    The generated custom repository is stored in the path specified by the TUNE_BANK_PATH environment variable. If this environment variable is not configured, the custom repository is stored in the ${HOME}/Ascend/latest/data/aoe/custom/graph/${soc_version} directory by default. For details about how to use the tuned custom repository, see Usage of Tuned Custom Repositories.

  4. After the training is complete, delete the AOE_MODE environment variable (command: unset AOE_MODE) to disable the tuning mode. To use the custom repository, ensure that the TUNE_BANK_PATH environment variable is valid.
  5. Restore the training script running on a single device to a distributed training script and perform distributed training. The network implements AllReduce fusion based on the gradient splitting policy in the repository.
  6. Key log information about the gradient splitting points calculated by the repository during training is as follows:
    1
    Use fusion library value
    

Restrictions

If gradient splitting points have been manually configured in the training script, the configured splitting policy is preferentially used. The configuration method is as follows:
from hccl.split.api import set_split_strategy_by_idx
set_split_strategy_by_idx([118,159,160])