Automatic Porting and Training

Overview

The following describes how to use the tool to port a ResNet-50 network.

Model and Dataset Downloads

  1. Download a ResNet-50 model from GitHub.

    git clone -b r2.6.0 https://github.com/tensorflow/models.git

    If the source code is downloaded to the /root/models directory, you can find the downloaded script in the /root/models/official/vision/image_classification/resnet/ directory.

  2. Download a dataset.

    Download the ImageNet 2012 dataset and transform data into TFRecords by using the imagenet_to_gcs.py script. For details, click here.

    Save the processed dataset to the /root/models/data/imagenet_TF/ directory.

Automatic Model Porting

  1. Before porting, manually add dataset sharding logic as required in Restrictions.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    dataset = tf.data.Dataset.from_tensor_slices(filenames)
    import npu_device as npu
    # Shard logic added by the NPU. The dataset and global batch will be sharded based on the number of clusters.
    dataset, batch_size = npu.distribute.shard_and_rebatch_dataset(dataset, batch_size)
    #if input_context:
    #  logging.info(
    #      'Sharding the dataset: input_pipeline_id=%d num_input_pipelines=%d',
    #      input_context.input_pipeline_id, input_context.num_input_pipelines)
    #  dataset = dataset.shard(input_context.num_input_pipelines,
    #                          input_context.input_pipeline_id)
    
  2. Install the tool dependencies in the operating environment.

    pip3 install pandas

    pip3 install openpyxl

    pip3 install google_pasta

  3. Perform automatic porting using the porting tool.
    1. Run the following command to navigate to the tool directory:

      cd <TFPlugin installation directory>/tfplugin/latest/python/site-packages/npu_device/convert_tf2npu/

    2. Execute script porting.
      • To perform training with a single device, run the following command:

        python3 main.py -i /root/models/official/vision/image_classification/resnet/ -o /root/models/resnet50/ -r /root/models/resnet50/ –m /root/models/official/vision/image_classification/resnet/resnet_ctl_imagenet_main.py

      • To perform distributed training, run the following command:

        python3 main.py -i /root/models/official/vision/image_classification/resnet/ -o /root/models/resnet50/ -r /root/models/resnet50/ –m /root/models/official/vision/image_classification/resnet/resnet_ctl_imagenet_main.py -d tf_strategy

        In the preceding command, -d specifies the distributed policy used by the original script, and tf_strategy indicates the tf.distribute.Strategy policy.

  4. Check the porting report in the /root/models/resnet50/report_npu_*** directory.
  5. Check the resultant script in the /root/models/resnet50/output_npu_*** directory.

    Rename the original script folder. For example, change /root/models/official/vision/image_classification/resnet to resnet_org.

    The library import depends on the original directory structure. As such, you need to rename the /root/models/resnet50/resnet_npu_*** file resnet and copy it back to the original directory. The following is the code snippet:

    cp -r /root/models/resnet50/resnet_npu_*** /root/models/official/vision/image_classification/resnet

Training with a Single Device

  1. The original script supports distributed training, and the ported script uses the HCCL APIs. Therefore, you need to prepare the resource information configuration file of a single device before training with a single device. Skip this step if not that case. (In this example, the resource information is set using the configuration file. You can also use the method described in Training Execution (Setting Environment Variables).)
    Ensure that the single-device resource configuration file contains only one device resource. Assume that the file is named rank_table_1p.json. The following provides a file template.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    {
    "server_count":"1", 
    "server_list":
    [
       {
            "device":[ 
                           {
                            "device_id":"0", 
                            "device_ip":"192.168.1.8", 
                            "rank_id":"0" 
                            }
                      ],
             "server_id":"10.0.0.10"
        }
    ],
    "status":"completed", 
    "version":"1.0"
    }
    

    For details about the configuration file, see Preparing the Ranktable Resource Configuration File.

  2. Configure the environment variables required for starting the training process.

    After the CANN software is installed, when you build and run your application as the CANN running user, log in to the environment as the CANN running user and run the source ${install_path}/set_env.sh command to set environment variables. {install_path} indicates the CANN installation path, for example, /usr/local/Ascend/ascend-toolkit. Then, perform the following configurations:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    # Set the environment variable for the installation path of the infrastructure software on which training depends. The following assumes that the installation user is HwHiAiUser.
    # Method 1: Install Ascend-CANN-Toolkit for training on an Ascend AI device, which serves as the development environment.
    . /home/HwHiAiUser/Ascend/ascend-toolkit/set_env.sh 
    # Method 2: Install Ascend-CANN-NNAE on an Ascend AI device.
    . /home/HwHiAiUser/Ascend/nnae/set_env.sh 
    
    # TFPlugin dependency
    . /home/HwHiAiUser/Ascend/tfplugin/set_env.sh
    
    # If multiple Python 3 versions exist in the operating environment, specify your Python installation path in the environment variable. The following takes Python 3.7.5 installation as an example.
    export PATH=/usr/local/python3.7.5/bin:$PATH
    export LD_LIBRARY_PATH=/usr/local/python3.7.5/lib:$LD_LIBRARY_PATH
    
    # Script directory, for example:
    export PYTHONPATH="$PYTHONPATH:/root/models"
    export JOB_ID=10086        # User-defined training job ID. Only letters, digits, hyphens (-), and underscores (_) are supported. You are advised not to use a number starting with 0.
    export ASCEND_DEVICE_ID=0  # Logical ID of the Ascend AI Processor, optional in single-device training and defaulted to 0, indicating that training is performed on device 0.
    export RANK_ID=0           # Rank ID of a training process in the collective communication process group. Fixed at 0 in single-device training.
    export RANK_SIZE=1         # Rank size of a device corresponding to the current training process in the cluster. Fixed at 1 in single-device training.
    export RANK_TABLE_FILE=/root/rank_table_1p.json # This parameter needs to be configured only when the hvd API or the shard API of the tf.data.Dataset object is used in the original training script. Note that the device_id parameter in rank_table takes precedence over the environment variable ASCEND_DEVICE_ID.
    
  3. Run your training script to start the training process.

    python3 /root/models/official/vision/image_classification/resnet/resnet_ctl_imagenet_main.py

  4. Check whether the training is successful.

Distributed Training with Multiple Devices

The following uses 2-device training as an example to describe how to use the ported script to perform distributed training on the the Ascend AI Processor.

  1. Prepare a 2-device resource configuration file. Assume that the file is named rank_table_2p.json. The following provides a file template.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    { 
    "server_count":"1", 
    "server_list": 
    [ 
       { 
            "device":[ 
                           { 
                            "device_id":"0",  
                            "device_ip":"192.168.1.8", 
                            "rank_id":"0" 
                            }, 
                            { 
                             "device_id":"1", 
                             "device_ip":"192.168.1.9",   // The two devices must be in the same network segment. Here, devices 0 and 1 are in the same network segment.
                             "rank_id":"1" 
                             } 
                      ], 
             "server_id":"10.0.0.10"  
        } 
    ], 
    "status":"completed", 
    "version":"1.0" 
    }
    

    For details about the configuration file, see Preparing the Ranktable Resource Configuration File.

  2. Start the training processes in different shells.

    Start training process 0.

    After the CANN software is installed, when you build and run your application as the CANN running user, log in to the environment as the CANN running user and run the source ${install_path}/set_env.sh command to set environment variables. {install_path} indicates the CANN installation path, for example, /usr/local/Ascend/ascend-toolkit. Then, perform the following configurations:

    1
    2
    3
    4
    5
    export PYTHONPATH="$PYTHONPATH:/root/models"
    export RANK_ID=0 
    export RANK_SIZE=2 
    export RANK_TABLE_FILE=/home/test/rank_table_2p.json 
    python3 /root/models/official/vision/image_classification/resnet/resnet_ctl_imagenet_main.py
    

    Start training process 1.

    After the CANN software is installed, when you build and run your application as the CANN running user, log in to the environment as the CANN running user and run the source ${install_path}/set_env.sh command to set environment variables. {install_path} indicates the CANN installation path, for example, /usr/local/Ascend/ascend-toolkit. Then, perform the following configurations:

    1
    2
    3
    4
    5
    6
    export PYTHONPATH="$PYTHONPATH:/root/models"
    export ASCEND_DEVICE_ID=1 
    export RANK_ID=1 
    export RANK_SIZE=2 
    export RANK_TABLE_FILE=/home/test/rank_table_2p.json 
    python3 /root/models/official/vision/image_classification/resnet/resnet_ctl_imagenet_main.py
    

    Alternatively, you can also customize a startup script to start multiple training processes through loops. For details, click here.