Automatic Migration

This section describes how to migrate PyTorch training scripts from GPU platforms to the Ascend NPU platform. The automatic migration mode can migrate PyTorch 2.1.0 and 2.2.0 training scripts. This mode is simple and involves the least content to modify. You only need to import the library code to the training script.

Restrictions

  • The automatic migration tool uses the dynamic feature of Python, but torch.jit.script does not support the dynamic syntax of Python. Therefore, if you use the automatic migration function when the original training script contains torch.jit.script, conflicts will arise. Currently, the torch.jit.script function is shielded during automatic migration. If the torch.jit.script function must be used in the user script, use Migration Using PyTorch GPU2Ascend for migration.
  • The automatic migration tool may conflict with the third-party libraries adapted to Ascend. If a conflict occurs, use Migration Using PyTorch GPU2Ascend for migration.
  • Currently, automatic migration does not support the channel_last feature. You are advised to use contiguous instead.
  • Only PyTorch 2.1.0 and later versions are supported.
  • If the backend used in the original script is NCCL, the backend is replaced with HCCL by the automated porting tool when init_process_group initializes the process group. If the subsequent code logic contains the judgment on whether the backend is NCCL, for example, assert backend in ['gloo', 'nccl']/if backend == 'nccl', manually change character string nccl to hccl.
  • If the user training script contains the torch.cuda.default_generators interface that is not supported by the Ascend NPU platform, manually change the interface to torch_npu.npu.default_generators.

Migration Operation

  1. Import the library code for automatic migration.

    Insert the following reference content in the first line of the training entry .py file. For example, insert the following reference content into the first line of train.py:

    1
    2
    3
    4
    import torch
    import torch_npu
    from torch_npu.contrib import transfer_to_npu   
    .....
    
  2. The migration is complete. Run the modified model script on the Ascend NPU platform according to the training process provided by the original script in the Training Configuration.
  3. If the weights can be saved after the training is complete, the weight saving function is successfully migrated. If the migration fails, rectify the fault by referring to Handling Migration Exceptions.

Handling Migration Exceptions

  • If the model contains the evaluation and online inference functions, you can import the automatic migration library code to the corresponding script and determine whether the migration is successful by checking whether the evaluation inference result and log printing status are consistent with those of the GPU and CPU.
  • If errors are reported for some CUDA APIs during training, certain operator or framework APIs may not be supported. You can perform the following operations to address the errors:
    • Use the analysis and migration tool to analyze the model script, obtain the list of APIs whose support statuses are not clear, and submit an issue to the Ascend open source community for help.
    • For details about how to adapt an Ascend C operator, see "Huawei-developed Ascend Plug-in > Single-Operator Adaptation OpPlugin Development" in Ascend Extension for PyTorch Suites and Third-Party Libraries .
    • Perform the following steps to move the unsupported APIs to CPUs for execution:
      1. Obtain the Ascend PyTorch source package. For details, see Installing PyTorch.
      2. Go to the directory where the obtained source package is stored and modify npu_native_functions.yaml.
        cd pytorch/torch_npu/csrc/aten
        vi npu_native_functions.yaml

        Add the operator API names under the tocpu configuration.

         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        tocpu:
          - angle
          - mode
          - nanmedian.dim_values
          - nansum
          - native_dropout
          - native_dropout_backward
          - poisson
          - vdot
          - view_as_complex
          - view_as_real
        
      3. Recompile the framework plugin package and install it by referring to Installing PyTorch.
      4. Execute the migrated training script again to check whether the model can be trained.