Model Development and Migration

In the PyTorch training scenario, PyTorch GPU2Ascend is used to migrate GPU-based training scripts into Ascend NPU-based scripts at a high speed, minimizing the workload of developers. This sample helps developers quickly experience the migration efficiency of the PyTorch GPU2Ascend tool. GPU-based training scripts can run in the Ascend NPU environment after being successfully migrated.

PyTorch GPU2Ascend is one of the analysis and migration tools. For details, see Analysis and Migration Tool User Guide.

Prerequisites

  1. You have performed operations in Environment Setup.
  2. In this sample, the ResNet-50 model is used. You have created a training script file named pytorch_main.py and copied the script content in PyTorch Training Script Sample in the GPU Environment to the file or downloaded the main.py file.
  3. You have uploaded the pytorch_main.py file to any directory on the training server. (Ensure that you have read and write permissions for the files in the directory.)

Performing Migration

You need to migrate the GPU-based training script to a script that supports the Ascend NPU environment, and then perform training.

  1. Import the library code for automatic migration to the training script pytorch_main.py.
    Taking PyTorch Training Script Sample in the Ascend NPU Environment as an example.
    1
    2
    3
    4
    5
     22
     23 import torch_npu
     24 from torch_npu.contrib import transfer_to_npu
     25
    # The code inserted in lines 23 and 24 is the library code for automatic migration, which can be directly used for training in the Ascend NPU environment.
    
  2. After the migration is complete, run the following training command to run the training script in the Ascend NPU environment:
    python pytorch_main.py -a resnet50 -b 32 --gpu 1 --dummy

    If the training is normal and the iteration log starts to be printed, the training function is successfully migrated.

    1
    2
    3
    4
    5
    6
    Using device: npu
    => creating model 'resnet50'
    => Dummy data is used!
    Epoch: [0][    1/40037] Time  4.923 ( 4.923)    Data  0.502 ( 0.502)    Loss 7.0165e+00 (7.0165e+00)    Acc@1   0.00 (  0.00)   Acc@5   0.00 (  0.00)
    Epoch: [0][   11/40037] Time  0.061 ( 0.996)    Data  0.000 ( 0.541)    Loss 1.2860e+01 (1.9566e+01)    Acc@1   0.00 (  0.00)   Acc@5   0.00 (  0.85)
    Epoch: [0][   21/40037] Time  0.063 ( 0.551)    Data  0.000 ( 0.285)    Loss 9.5061e+00 (1.5033e+01)    Acc@1   0.00 (  0.00)   Acc@5   0.00 (  0.74)
    
  3. If weights are saved successfully, the weight saving migration is successful.