Training with a Single Device

This section details how to run a ported TensorFlow training script on a single device.

If model porting is not performed, you can obtain a training script that has been ported and adapted from https://gitee.com/ascend/modelzoo to experience the training process.

Each device corresponds to a training process. It is not supported to run multiple training processes on a single device.

Prerequisites

  • You have set up a basic software and hardware environment powered by one Ascend AI Processor or prepared an Ascend basic image that contains TensorFlow-related modules as described in Environment Setup.
  • You have prepared a TensorFlow training script and a matched dataset.
  • If HCCL APIs are used in the training script, you need to configure the device resource before training by using the configuration file (ranktable file) or environment variables. You only need to configure the current device resource for this single-device training and start the training process. This section does not describe the procedure. For details, see Distributed Training with Multiple Devices.

    If you use the ranktable file, set the distributed environment variables RANK_ID to 0 and RANK_SIZE to 1.

Procedure

  1. Configure the environment variables required for starting the training process.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    # Set one of the following environment variables for the installation path of the infrastructure software on which training depends. The following assumes that the installation user is HwHiAiUser:
    # Method 1: Install Ascend-CANN-Toolkit for training on an Ascend AI device, which serves as the development environment.
    . /home/HwHiAiUser/Ascend/ascend-toolkit/set_env.sh 
    # Method 2: Install Ascend-CANN-NNAE on an Ascend AI device.
    . /home/HwHiAiUser/Ascend/nnae/set_env.sh 
    
    # TF Adapter Python library. ${TFPLUGIN_INSTALL_PATH} indicates the installation path of the TF Adapter package.
    export PYTHONPATH=${TFPLUGIN_INSTALL_PATH}:$PYTHONPATH
    
    # If multiple Python 3 versions exist in the operating environment, specify your Python installation path in the environment variable. The following takes Python 3.7.5 installation as an example.
    export PATH=/usr/local/python3.7.5/bin:$PATH
    export LD_LIBRARY_PATH=/usr/local/python3.7.5/lib:$LD_LIBRARY_PATH
    
    # Add the path of the current script to PYTHONPATH. For example:
    export PYTHONPATH="$PYTHONPATH:/root/models"
    
    export JOB_ID=10087        # User-defined training job ID. Only letters, digits, hyphens (-), and underscores (_) are supported. You are advised not to use a number starting with 0.
    export ASCEND_DEVICE_ID=0  # Logical ID of the Ascend AI Processor, optional in single-device training and defaulted to 0, indicating that training is performed on device 0.
    

    If you need to upgrade GCC in OSs such as CentOS, Debian, and BC-Linux, add ${install_path}/lib64 to the LD_LIBRARY_PATH variable of the dynamic library search path. Replace {install_path} with the GCC installation path. For details, see 5.

  2. (Optional) Configure environment variables for auxiliary functions.
    • Enable computational graph dump by setting the corresponding environment variable before starting the training script to facilitate fault locating.
      1
      2
      export DUMP_GE_GRAPH = 2                  # 1: dumps all; 2: dumps without data such as weights; 3: dumps only the network structure.
      export DUMP_GRAPH_PATH = /home/dumpgraph # Specify the path for storing dump graph files by using this environment variable.
      

      After the training job is started, several dump graph files are generated in the path specified by ${DUMP_GRAPH_PATH}/${pid}_${deviceid}, including the .pbtxt and .txt files. Given the large number and sizes of dump files, dump can be skipped if there is no fault locating need.

    • If you want the files generated during program compilation and execution to be flushed to the normalized directory, you can use the environment variables ASCEND_CACHE_PATH and ASCEND_WORK_PATH to set the paths for storing shared files and process-exclusive files, respectively.
      1
      2
      export ASCEND_CACHE_PATH=/repo/task001/cache
      export ASCEND_WORK_PATH=/repo/task001/172.16.1.12_01_03
      

      For details about the restrictions on the usage of the environment variables ASCEND_CACHE_PATH and ASCEND_WORK_PATH and the description of the flushed files, see Installation and Configuration > Flush File Configuration in Environment Variables.

      • Before setting the environment variables, run the env command to check whether ASCEND_CACHE_PATH and ASCEND_WORK_PATH exist. It is recommended that all functions use the same planned path.
  3. Run your training script to start the training process.

    python3 /home/xxx.py

Training Result Check

  1. Check that the training process is normal and the loss is converged.

  2. After training, find the following directories and files:
    • model directory: stores checkpoint files and model files. Whether to generate this directory depends on the script implementation. If saver = tf.train.Saver() and saver.save() are used in the training script to save the model, files similar to the following are generated:

    • kernel_meta: stores the .o and .json files of operators on the condition that op_debug_level is set to 3 in the training script. The files can be used for subsequent fault locating.

Troubleshooting

If the script execution fails, analyze and locate the fault based on the following logs:

Path of run logs generated when the app is running on the host: $HOME/ascend/log/run/plog/plog-pid_*.log.

Path of the run logs generated when the app is running on the device: $HOME/ascend/log/run/device-id/device-pid_*.log.

$HOME indicates the root directory of the user on the host.

For more information, see Log Reference.

You can identify the error module and determine the cause by using ERROR-level logs.

Figure 1 Error log example
Table 1 Fault location techniques

Module Name

Error

Solution

System error

Environment and version mismatch

Check the version mapping and system installation.

GE

GE graph compilation or verification error

Specific error causes are provided for verification errors. You only need to modify the network script as prompted.

Runtime

Initialization or graph execution failure due to an environment exception

If initialization fails, check the environment configuration and whether the environment is occupied by other processes.