Automatic Migration

This section describes how to migrate PyTorch training scripts from GPU platforms to the Ascend NPU platform. The automatic migration mode can migrate PyTorch 1.11.0, 2.1.0, 2.2.0, 2.3.1, 2.4.0, 2.5.1, and 2.6.0 training scripts. This mode is simple and requires the least modifications. You only need to import the library code to the training script.

In automatic migration mode, PyTorch 1.11.0 does not support Atlas A3 training products/Atlas A3 inference products.

Restrictions

  • The automatic migration tool uses the dynamic feature of Python, but torch.jit.script does not support the dynamic syntax of Python. Therefore, if you use the automatic migration function when the original training script contains torch.jit.script, conflicts will arise. Currently, the torch.jit.script function is shielded during automatic migration. If the torch.jit.script function must be used in the user script, use Migration Using PyTorch GPU2Ascend for migration.
  • The automatic migration tool may conflict with the third-party libraries adapted to Ascend. If a conflict occurs, use Migration Using PyTorch GPU2Ascend for migration.
  • Currently, automatic migration does not support the channel_last feature. You are advised to use contiguous instead.
  • If the backend used in the original script is NCCL, the backend is replaced with HCCL by the automated porting tool when init_process_group initializes the process group. If the subsequent code logic contains the judgment on whether the backend is NCCL, for example, assert backend in ['gloo', 'nccl']/if backend == 'nccl', manually change character string nccl to hccl.
  • If the user training script contains the torch.cuda.default_generators interface that is not supported by the Ascend NPU platform, manually change the interface to torch_npu.npu.default_generators.

Migration Operation

  1. Import the library code for automatic migration.

    Insert the following reference content in the first line of the training entry .py file. For example, insert the following reference content into the first line of train.py:

    import torch
    import torch_npu
    from torch_npu.contrib import transfer_to_npu   
    .....
  2. The migration is complete. Run the modified model script on the Ascend NPU platform according to the training process provided by the original script in the Training Configuration.
  3. After the training is complete, the migration tool automatically saves the weight, indicating that the migration is successful. If the migration fails, rectify the fault by referring to Handling Migration Exceptions.

Handling Migration Exceptions

  • If the model contains evaluation and online inference functions, you can import the automatic migration library code to the corresponding script and determine whether the migration is successful by checking whether the evaluation/inference result and log printing status of GPUs are consistent with those of CPUs.
  • If errors are reported for some CUDA APIs during training, certain operator or framework APIs may not be supported. You can perform the following operations to address the errors:
    • Use the analysis and migration tool to analyze the model script, obtain the list of APIs whose support statuses are not clear, and submit an issue to the Ascend open-source community for help.
    • For details about how to adapt Ascend C operators, see "OpPlugin-based Operator Adaptation" in the Ascend Extension for PyTorch Feature Guide.
    • Perform the following steps to move the unsupported APIs to CPUs for execution:

      This method applies only to PyTorch 1.11.0.

      1. Obtain the Ascend PyTorch source package. For details, see "Method 2: Installing from Source Code" in the Ascend Extension for PyTorch Software Installation Guide.
      2. Go to the directory where the obtained source package is stored and modify npu_native_functions.yaml.
        cd pytorch/torch_npu/csrc/aten
        vi npu_native_functions.yaml

        Add the operator API names under the tocpu configuration.

         1
         2
         3
         4
         5
         6
         7
         8
         9
        10
        11
        tocpu:
          - angle
          - mode
          - nanmedian.dim_values
          - nansum
          - native_dropout
          - native_dropout_backward
          - poisson
          - vdot
          - view_as_complex
          - view_as_real
        
      3. Recompile the framework plugin package and install it. For details, see "Method 2: Installing from Source Code" in the Ascend Extension for PyTorch Software Installation Guide.
      4. Execute the migrated training script again to check whether the model can be trained.