Migration Operation

Procedure

  1. Start script migration.
    You can initiate a script migration task in any of the following ways:
    • Choose Ascend > Migration Tools > PyTorch GPU2Ascend on the toolbar.
    • Click on the toolbar.
    • Right-click the training project and choose PyTorch GPU2Ascend from the shortcut menu.
  2. Configure parameters as required.
    Figure 1 shows the PyTorch GPU2Ascend dialog box.
    Figure 1 PyTorch GPU2Ascend parameters
    Table 1 PyTorch GPU2Ascend parameters

    Parameter

    Description

    PyTorch Version

    (Required) PyTorch version of the script to be migrated. Currently, only 1.5.0 and 1.8.1 are supported. The default version is 1.5.0.

    Input Path

    (Required) Directory of the original script file to be migrated. Click the folder icon to select one.

    Output Path

    (Required) Output path of the script migration result file. Click the folder icon to select one.

    If migration to multi-device is disabled, the output directory name is xxx_msft. If migration to multi-device is enabled, the output directory name is xxx_msft_multi. xxx indicates the name of the folder where the original script is located.

    Custom Rule

    (Optional) Custom migration rule enabled. If this parameter is enabled, the Rule File parameter is displayed.

    Rule File

    (Required) This parameter is available only when Custom Rule is enabled. Click the folder icon to select the path storing the JSON file of the user-defined common migration rule. The JSON file contains three parts: function parameter modification, function name modification, and module name modification. For details about how to write the JSON file for custom migration rules, see Examples of Custom Migration Rules.

    Distributed Rule

    (Optional) Migration from a GPU single-device script to an NPU multi-device script. This parameter can be used only in scenarios where data is loaded in torch.utils.data.DataLoader mode.

    Main File

    (Required) This parameter is available only when Distributed Rule is enabled. Click the folder icon and select the entry Python file for the training script.

    Target Model

    (Optional) This parameter is available only when Distributed Rule is enabled. It indicates the variable name of the target model and the default value is model.

    If Amp Transplant is used at the same time, the two target models must be the same.

    Replace Unsupported APIs

    (Optional) If this parameter is enabled, the following unsupported APIs are replaced with APIs with similar functions, which may cause accuracy and performance deterioration.

    • apex.parallel.DistributedDataParallel
    • torch.cuda.get_device_properties
    • torch.nn.Conv3d
    • torch.nn.functional.pad
    • torch.nn.ReflectionPad2d
    • torch.nn.ReplicationPad2d
    • torch.nn.SyncBatchNorm
    • torch.nn.SyncBatchNorm.convert_sync_batchnorm
    • torch.repeat_interleave
    • torch.set_default_tensor_type

    Amp Transplant

    (Optional) This parameter is available only when PyTorch Version is set to 1.5.0. Enabling this parameter can migrate the torch.cuda.amp mixed precision training script into the apex.amp mixed precision training script.

    NOTE:
    • This function may deteriorate network performance and has great limitations.
    • The mixed precision migration is performed only when the original network uses torch.cuda.amp for mixed precision training.

    Target Model

    (Optional) This parameter is available only when Amp Transplant is enabled. It indicates the input model name. The default value is model.

    If Distributed Rule is used at the same time, the two target models must be the same.

  3. Click Transplant to execute the migration task.

    After the migration, check the result file in the Output Path directory.

    ├── xxx_msft/xxx_msft_multi              // Directory for storing the script migration result.
    │   ├── generated script file    // The directory structure is the same as that of the script file directory before the migration.
    │   ├── msFmkTranspltlog.txt         // Script migration log file. The maximum size of a log file is 1 MB. If the size of a log file exceeds 1 MB, it is stored in multiple files. A maximum of 10 files are supported.
    │   ├── unsupported_op.csv         // File of unsupported operators
    │   ├── change_list.csv              // Change history file
    │   ├── run_distributed_npu.sh       // Multi-device boot shell script
    │   ├── ascend_function              // If the Replace Unsupported APIs parameter is enabled, a directory that contains equivalent operators is generated.

After Migration

  • If the Replace Unsupported APIs parameter is enabled, add the parent file path to the environment variable PYTHONPATH before executing the migrated model.
  • If you need to use the get_device_properties(device) interface in the similar_api.py file under ascend_function, manually edit the parameter values in StubDevicePropertise(object) as required.
  • If the Distributed Rule parameter is enabled, the following run_distributed_npu.sh file is generated after the migration:
    export MASTER_ADDR=127.0.0.1
    export MASTER_PORT=29688
    export HCCL_WHITELIST_DISABLE=1
    
    NPUS=($(seq 0 7))
    export RANK_SIZE=${#NPUS[@]}
    rank=0
    for i in ${NPUS[@]}
    do
        export DEVICE_ID=${i}
        export RANK_ID=${rank}
        echo run process ${rank}
        please input your shell script here > output_npu_${i}.log 2>&1 &
        let rank++
    done
    Table 2 Parameters

    Parameter

    Description

    MASTER_ADDR

    IP address of the training server.

    MASTER_PORT

    Port of the training server.

    HCCL_WHITELIST_DISABLE

    HCCL backend environment.

    NPUS

    Running on a specified NPU.

    RANK_SIZE

    Number of the Ascend AI Processors.

    DEVICE_ID

    Physical ID of the Ascend AI Processor.

    RANK_ID

    Logical ID of the Ascend AI Processor to be invoked.

    Before executing the migrated model, replace the please input your shell script here statement in the run_distributed_npu.sh file with the original training shell script of the model. After the run_distributed_npu.sh file is executed, logs of the specified NPU are generated.

  • The platform of the migrated script is different from that of the original script. Therefore, during the debugging and running of the migrated script, an exception may be thrown due to such causes as operator differences and the process is terminated. This type of exception needs to be further debugged and resolved based on the specific exception information.
  • After the analysis and migration, you can perform training by following the instructions in Model Training.