Performing Training
After configuring resources for distributed training using environment variables, you can start the training process by referring to this section.
Prerequisites
- You have been familiar with Precautions.
- You have prepared a TensorFlow training script and a matched dataset.
- You have set the environment variables for resource information on each training device. For details, see Configuring Resource Information.
- When performing training on multiple devices, ensure that the models executed on different devices are the same. Otherwise, the service fails to be executed. For details, see How Do I Fix Application Errors Caused by Model Execution on Multiple Devices?.
Single-Server Multi-Device Scenario
When performing training on multiple devices, ensure the training process is initiated on each participating device.
Assume that there is only one AI Server node and eight devices on the node. You can construct a startup script to cyclically start the training process on each device.
- Create a startup script naming tf_start_8p.sh as follows.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
# Set one of the following environment variables for the installation path of the infrastructure software on which training depends. The following assumes that the installation user is HwHiAiUser: # Method 1: Install Ascend-CANN-Toolkit for training on an Ascend AI device, which serves as the development environment. . /home/HwHiAiUser/Ascend/ascend-toolkit/set_env.sh # Method 2: Install Ascend-CANN-NNAE on an Ascend AI device. . /home/HwHiAiUser/Ascend/nnae/set_env.sh # TF Adapter Python library. ${TFPLUGIN_INSTALL_PATH} indicates the installation path of the TF Adapter package. export PYTHONPATH=${TFPLUGIN_INSTALL_PATH}:$PYTHONPATH export JOB_ID=10087 # User-defined job ID, which can contain uppercase letters, lowercase letters, digits, hyphens (-), and underscores (_). for((CURRENT_DEVICE=0;CURRENT_DEVICE<8;CURRENT_DEVICE++)); do export ASCEND_DEVICE_ID=${CURRENT_DEVICE} # Execute the training script. Replace the training script path, name, and other input parameters as required. nohup python3 /home/test/main.py > /home/test/train_$ASCEND_DEVICE_ID.log 2>&1 & done
(Optional) Before starting the training process, configure environment variables of the following auxiliary functions.- Enable computational graph dump by setting the corresponding environment variable before starting the training script to facilitate fault locating.
1 2
export DUMP_GE_GRAPH = 2 # 1: dumps all; 2: dumps without data such as weights; 3: dumps only the network structure. export DUMP_GRAPH_PATH = /home/dumpgraph # Specify the path for storing dump graph files by using this environment variable.
After the training job is started, several dump graph files are generated in the path specified by ${DUMP_GRAPH_PATH}/${pid}_${deviceid}, including the .pbtxt and .txt files. Given the large number and sizes of dump files, dump can be skipped if there is no fault locating need.
- If you want the files generated during program compilation and execution to be flushed to the normalized directory, you can use the environment variables ASCEND_CACHE_PATH and ASCEND_WORK_PATH to set the paths for storing shared files and process-exclusive files, respectively.
1 2
export ASCEND_CACHE_PATH=/repo/task001/cache export ASCEND_WORK_PATH=/repo/task001/172.16.1.12_01_03
For details about the restrictions on the usage of the environment variables ASCEND_CACHE_PATH and ASCEND_WORK_PATH and the description of the flushed files, see Installation and Configuration > Flush File Configuration in Environment Variables.
- Before setting the environment variables, run the env command to check whether ASCEND_CACHE_PATH and ASCEND_WORK_PATH exist. It is recommended that all functions use the same planned path.
- Enable computational graph dump by setting the corresponding environment variable before starting the training script to facilitate fault locating.
- Run the startup script to start the training process.
1bash tf_start_8p.sh
Multi-Server Multi-Device Scenario
When performing training on multiple devices, ensure the training process is initiated on each participating device.
Assume that there are two AI Server nodes involved in distributed training and each AI Server node has eight devices. You can perform the following steps to construct a startup script to cyclically start the training process on each device.
- Create a startup script naming tf_start_16p.sh as follows.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
# Set one of the following environment variables for the installation path of the infrastructure software on which training depends. The following assumes that the installation user is HwHiAiUser: # Method 1: Install Ascend-CANN-Toolkit for training on an Ascend AI device, which serves as the development environment. . /home/HwHiAiUser/Ascend/ascend-toolkit/set_env.sh # Method 2: Install Ascend-CANN-NNAE on an Ascend AI device. . /home/HwHiAiUser/Ascend/nnae/set_env.sh # TF Adapter Python library. ${TFPLUGIN_INSTALL_PATH} indicates the installation path of the TF Adapter package. export PYTHONPATH=${TFPLUGIN_INSTALL_PATH}:$PYTHONPATH # Obtain input parameters. for para in $* do if [[ $para == --server_index* ]];then server_index=`echo ${para#*=}` elif [[ $para == --devices_num* ]];then devices_num=`echo ${para#*=}` fi done rank_size=${devices_num} linux_num=$servers_num export JOB_ID=10087 # User-defined job ID, which can contain uppercase letters, lowercase letters, digits, hyphens (-), and underscores (_). for((CURRENT_DEVICE=$((rank_size*server_index));CURRENT_DEVICE<$((((server_index+1))*rank_size));CURRENT_DEVICE++)); do export ASCEND_DEVICE_ID=`expr ${CURRENT_DEVICE} - $((rank_size*server_index))` # Execute the training script. Replace the training script path, name, and other input parameters as required. nohup python3 /home/test/main.py > /home/test/train_$ASCEND_DEVICE_ID.log 2>&1 & done
(Optional) Before starting the training process, configure environment variables of the following auxiliary functions.- Enable computational graph dump by setting the corresponding environment variable before starting the training script to facilitate fault locating.
1 2
export DUMP_GE_GRAPH = 2 # 1: dumps all; 2: dumps without data such as weights; 3: dumps only the network structure. export DUMP_GRAPH_PATH = /home/dumpgraph # Specify the path for storing dump graph files by using this environment variable.
After the training job is started, several dump graph files are generated in the path specified by ${DUMP_GRAPH_PATH}/${pid}_${deviceid}, including the .pbtxt and .txt files. Given the large number and sizes of dump files, dump can be skipped if there is no fault locating need.
- If you want the files generated during program compilation and execution to be flushed to the normalized directory, you can use the environment variables ASCEND_CACHE_PATH and ASCEND_WORK_PATH to set the paths for storing shared files and process-exclusive files, respectively.
1 2
export ASCEND_CACHE_PATH=/repo/task001/cache export ASCEND_WORK_PATH=/repo/task001/172.16.1.12_01_03
For details about the restrictions on the usage of the environment variables ASCEND_CACHE_PATH and ASCEND_WORK_PATH and the description of the flushed files, see Installation and Configuration > Flush File Configuration in Environment Variables.
- Before setting the environment variables, run the env command to check whether ASCEND_CACHE_PATH and ASCEND_WORK_PATH exist. It is recommended that all functions use the same planned path.
- Enable computational graph dump by setting the corresponding environment variable before starting the training script to facilitate fault locating.
- Run the startup script to start the training process.
1 2 3 4
# Start the training process on node 0. bash tf_start_16p.sh --server_index=0 --devices_num=8 # Start the training process on node 1. bash tf_start_16p.sh --server_index=1 --devices_num=8