Model Development and Migration

The PyTorch GPU2Ascend tool can migrate GPU-based training scripts into NPU-based scripts at a high speed, minimizing the workload of developers. This sample helps developers quickly experience the migration efficiency of the PyTorch GPU2Ascend tool. GPU-based training scripts can run on NPUs after being successfully migrated.

PyTorch GPU2Ascend is one of the analysis and migration tools. For details, see Analysis and Migration Tool.

Prerequisites

In this sample, the ResNet-50 model is used. Download the main.py file.

Environment Setup

  1. Prepare a training server equipped with Ascend 910 AI Processors and install the corresponding driver and firmware.
  2. Install the Ascend-CANN-Toolkit. For details, see Installing the CANN Software Package.
  3. The following uses PyTorch 2.1.0 as an example. For details, see Ascend Extension for PyTorch Configuration and Installation .
  4. Configure the environment variable.

    After the CANN software is installed, when you build and run your application as the CANN running user, log in to the environment as the CANN running user and run the source ${install_path}/set_env.sh command to set environment variables. {install_path} indicates the CANN installation path, for example, /usr/local/Ascend/ascend-toolkit.

  5. Download the main.py file and upload it to the personal directory on the training server.

Migration

  1. Import the library code for automatic migration to the training script (main.py).
     23
     24 import torch_npu
     25 from torch_npu.contrib import transfer_to_npu
     26
    # The code inserted in lines 24 and 25 is the library code for automatic migration, which can be directly used for training in the NPU environment.
  2. After the migration is complete, run the following training command to run the training script on the NPU:
    python main.py -a resnet50 -b 32 --gpu 1 --dummy

    If the training is normal and the iteration log starts to be printed, the training function is successfully migrated.

    Use GPU: 1 for training
    => creating model 'resnet50'
    => Dummy data is used!
    Epoch: [0][    1/40037] Time  8.287 ( 8.287)    Data  0.504 ( 0.504)    Loss 7.0919e+00 (7.0919e+00)    Acc@1   0.00 (  0.00)   Acc@5   0.00 (  0.00)
    Epoch: [0][   11/40037] Time  0.097 ( 1.268)    Data  0.000 ( 0.479)    Loss 1.5627e+01 (1.8089e+01)    Acc@1   0.00 (  0.00)   Acc@5   3.12 (  0.57)
    Epoch: [0][   21/40037] Time  0.096 ( 0.710)    Data  0.000 ( 0.253)    Loss 7.7462e+00 (1.4883e+01)    Acc@1   0.00 (  0.00)   Acc@5   0.00 (  0.60)
  3. If weights are saved successfully, the weight saving migration is successful.