Training Execution (Configuring Resources via the Rank Table)

You can configure the NPU resources for collective communication in the rank table file, and specify the NPU resources to use when starting the training process.

Prerequisites

Preparing the Rank Table Resource Configuration File

The rank table is in JSON format and records the information of all NPUs involved in collective communication. You can prepare the rank table resource configuration file as described in "Reference" > "Cluster Information Configuration" of HCCL User Guide.

Single-Server Multi-Device Scenario

When performing training on multiple devices, ensure the training process is initiated on each participating device.

Assume that there is only one AI Server node and eight devices on the node. You can perform the following steps to construct a startup script to cyclically start the training process on each device.

  1. Construct a startup script named tf_start_8p.sh as follows.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    # Configure environment variables of the CANN software. The default installation path of the root user is used as an example.
    source /usr/local/Ascend/cann/set_env.sh
    
    # TF Adapter Python library. ${TFPLUGIN_INSTALL_PATH} indicates the installation path of the TF Adapter package.
    export PYTHONPATH=${TFPLUGIN_INSTALL_PATH}:$PYTHONPATH
    
    export RANK_SIZE=8
    export RANK_TABLE_FILE=/home/test/rank_table_8p.json    # Path of the rank table resource configuration file. Replace it with the actual path.
    export JOB_ID=10087      # User-defined task ID, which can contain uppercase letters, lowercase letters, digits, hyphens (-), and underscores (_).
    
    for((RANK_ID=0;RANK_ID<$((RANK_SIZE));RANK_ID++));
    do
        export RANK_ID=$RANK_ID
        export ASCEND_DEVICE_ID=$RANK_ID
        # Execute the training script. Replace the training script path, name, and other input parameters as required.
        nohup python3 /home/test/main.py > /home/test/train_$ASCEND_DEVICE_ID.log 2>&1 &
    done
    
    (Optional) Before starting the training process, configure environment variables of the following auxiliary functions.
    • Enable computational graph dump by setting the corresponding environment variable before starting the training script to facilitate fault locating.
      1
      2
      export DUMP_GE_GRAPH=2                  # 1: dumps all; 2: dumps without data such as weights; 3: dumps only the network structure.
      export DUMP_GRAPH_PATH=/home/dumpgraph  # Specify the path for storing dump graph files by using this environment variable.
      

      After the training job is started, several dump graph files are generated in the path ${DUMP_GRAPH_PATH}/pid_${pid}_deviceid_${deviceid}, including the .pbtxt and .txt files. Given the large number and sizes of dump files, dump can be skipped if there is no fault locating need.

    • If you want the files generated during program compilation and execution to be flushed to a unified storage directory, you can use the environment variables ASCEND_CACHE_PATH and ASCEND_WORK_PATH to set the paths for storing shared files and process-exclusive files, respectively.
      1
      2
      export ASCEND_CACHE_PATH=/repo/task001/cache
      export ASCEND_WORK_PATH=/repo/task001/172.16.1.12_01_03
      

      For details about the restrictions on the usage of the environment variables ASCEND_CACHE_PATH and ASCEND_WORK_PATH and the description of the flushed files, see "Installation" in Environment Variables.

      • Before setting the environment variables, run the env command to check whether ASCEND_CACHE_PATH and ASCEND_WORK_PATH exist. It is recommended that all functions use the same planned path.
  2. Run your script to start the training process.
    1
    bash tf_start_8p.sh
    

Multi-Server Multi-Device Scenario

When performing training on multiple devices, ensure the training process is initiated on each participating device.

Assume that there are two AI Server nodes involved in distributed training and each AI Server node has eight devices. You can perform the following steps to construct a startup script to cyclically start the training process on each device.

  1. Construct a startup script named tf_start_16p.sh as follows.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    # Configure environment variables of the CANN software. The default installation path of the root user is used as an example.
    source /usr/local/Ascend/cann/set_env.sh
    
    # TF Adapter Python library. ${TFPLUGIN_INSTALL_PATH} indicates the installation path of the TF Adapter package.
    export PYTHONPATH=${TFPLUGIN_INSTALL_PATH}:$PYTHONPATH
    
    # Obtain input parameters.
    for para in $*
    do
        if [[ $para == --server_index* ]];then
            server_index=`echo ${para#*=}`
        elif [[ $para == --devices_num* ]];then
    	    devices_num=`echo ${para#*=}`
        elif [[ $para == --servers_num* ]];then
            servers_num=`echo ${para#*=}`
        fi
    done
    
    rank_size=${devices_num}
    linux_num=$servers_num
    export RANK_SIZE=`awk 'BEGIN{printf "%.0f\n",'${devices_num}'*'${linux_num}'}'`
    export RANK_TABLE_FILE=/home/test/rank_table.json   # Path of the rank table resource configuration file. Replace it with the actual path.
    export JOB_ID=10087  # User-defined task ID, which can contain uppercase letters, lowercase letters, digits, hyphens (-), and underscores (_).
    
    for((RANK_ID=$((rank_size*server_index));RANK_ID<$((((server_index+1))*rank_size));RANK_ID++));
    do
        # Set environment variables.
        export RANK_ID=$RANK_ID
        export ASCEND_DEVICE_ID=`expr ${RANK_ID} - $((rank_size*server_index))`
        # Execute the training script. Replace the training script path, name, and other input parameters as required.
        nohup python3 /home/test/main.py > /home/test/train_$ASCEND_DEVICE_ID.log 2>&1 &
    done
    
    (Optional) Before starting the training process, configure environment variables of the following auxiliary functions.
    • Enable computational graph dump by setting the corresponding environment variable before starting the training script to facilitate fault locating.
      1
      2
      export DUMP_GE_GRAPH=2                  # 1: dumps all; 2: dumps without data such as weights; 3: dumps only the network structure.
      export DUMP_GRAPH_PATH=/home/dumpgraph  # Specify the path for storing dump graph files by using this environment variable.
      

      After the training job is started, several dump graph files are generated in the path ${DUMP_GRAPH_PATH}/pid_${pid}_deviceid_${deviceid}, including the .pbtxt and .txt files. Given the large number and sizes of dump files, dump can be skipped if there is no fault locating need.

    • If you want the files generated during program compilation and execution to be flushed to a unified storage directory, you can use the environment variables ASCEND_CACHE_PATH and ASCEND_WORK_PATH to set the paths for storing shared files and process-exclusive files, respectively.
      1
      2
      export ASCEND_CACHE_PATH=/repo/task001/cache
      export ASCEND_WORK_PATH=/repo/task001/172.16.1.12_01_03
      

      For details about the restrictions on the usage of the environment variables ASCEND_CACHE_PATH and ASCEND_WORK_PATH and the description of the flushed files, see "Installation" in Environment Variables.

      • Before setting the environment variables, run the env command to check whether ASCEND_CACHE_PATH and ASCEND_WORK_PATH exist. It is recommended that all functions use the same planned path.
  2. Run your script to start the training process.
    1
    2
    3
    4
    # Start the training process on node 0.
    bash tf_start_16p.sh --server_index=0 --devices_num=8 --servers_num=2
    # Start the training process on node 1.
    bash tf_start_16p.sh --server_index=1 --devices_num=8 --servers_num=2