Migration Operation

Migration Procedure

  1. Use either of the following methods to start script migration:
    • Click on the toolbar.
    • Choose Ascend > Migration Tools > X2MindSpore from the menu bar.
    • Right-click a folder in the project directory and choose X2MindSpore from the shortcut menu.
  2. Configure parameters as required.

    After X2MindSpore is started, the page shown in Figure 1 is displayed. Configure corresponding parameters as required.

    Figure 1 X2MindSpore parameter configuration page (Distributed enabled)
    Figure 2 X2MindSpore parameter configuration page (Graph enabled)
    Table 1 X2MindSpore parameters

    Parameter

    Description

    Framework

    (Required) Framework of the original script to be migrated.

    The values are as follows:

    • PyTorch (default)
    • TensorFlow 1
    • TensorFlow 2

    Input Path

    Directory of the original project to be migrated. This parameter is mandatory.

    Output Path

    Output path of the script migration result file. This parameter is mandatory.

    • If migration to single-device scripts is disabled, that is, Distributed is disabled, the output directory will be named xxx_x2ms.
    • If migration to multi-device scripts is enabled, that is, Distributed is enabled, the output directory will be named xxx_x2ms_multi.

    xxx indicates the name of the folder that houses the original scripts.

    Distributed

    Used to migrate GPU single-device scripts to multi-device scripts. This parameter applies only to the PyTorch and TensorFlow 2 frameworks. This parameter is optional. It is disabled by default.

    After this parameter is enabled, you need to configure the Device parameter to migrate the GPU single-device scripts to multi-device scripts for a specified device. The options are as follows:

    • Ascend (default)
    • GPU

    Graph

    The migrated scripts can run in Graph mode in MindSpore 1.8 to 1.10. This parameter is optional. It is disabled by default, that is, the scripts are migrated to the PyNative mode by default.

    Currently, only ResNet, BiT, and UNet series models in Model List can be migrated to the Graph mode. This parameter cannot be used together with Distributed.

    After this parameter is enabled, you can set the Target Model parameter to specify the variable name of the target model. The default value is model.

  3. Click Transplant to execute the migration task.

    After the migration, check the result file in the Output Path directory.

    ├── xxx_x2ms/xxx_x2ms_multi              // Directory for storing the script migration result.
    │   ├── migrated script file             // The directory structure is the same as that of the script file directory before migration.
    │   ├── x2ms_adapter                 // Adaptation layer file.
    │   ├── unsupported_api.csv          // File of unsupported APIs.
    │   ├── custom_supported_api.csv     // File of supported APIs customized for the tool (only training scripts of the PyTorch framework are supported).
    │   ├── supported_api.csv            // File of supported APIs.
    │   ├── deleted_api.csv              // File of deleted APIs.
    │   ├── x2mindspore.log              // Migration log. The maximum size of a log file is 1 MB. If the size of a log file exceeds 1 MB, it is stored in multiple files. A maximum of 10 files are supported.
    │   ├── run_distributed_ascend.sh       // Shell script for starting the multi-device function, which is generated when the Distributed parameter is enabled and Ascend is set for Device.
    │   ├── rank_table_2pcs.json            // Example of a networking information file in the two-device environment, which is generated when the Distributed parameter is enabled and Ascend is set for Device.
    │   ├── rank_table_8pcs.json            // Example of a networking information file in the eight-device environment, which is generated when the Distributed parameter is enabled and Ascend is set for Device.
  4. Before executing the migrated model files, add the output project path to the environment variable PYTHONPATH. The following is an example:
    export PYTHONPATH=${HOME}/output/xxx_x2ms:$PYTHONPATH

Follow-up Operations

  • If the Distributed parameter is enabled, you need to run the multi-device script on the device specified by Device after the migration.
    • Device specifies the Ascend device.
      1. Refer to Configuring Distributed Environment Variables to configure the generated networking information file (.json) in the multi-device environment.
      2. Replace the please input your shell script here statement in the run_distributed_ascend.sh file with the execution command of the original training script of the model.
        #!/bin/bash
        echo "=============================================================================================================="
        echo "Please run the script as: "
        echo "bash run_distributed_ascend.sh RANK_TABLE_FILE RANK_SIZE RANK_START DEVICE_START"
        echo "For example: bash run_distributed_ascend.sh /path/rank_table.json 8 0 0"
        echo "It is better to use the absolute path."
        echo "=============================================================================================================="
        execute_path=$(pwd)
        echo "${execute_path}"
        export RANK_TABLE_FILE=$1
        export RANK_SIZE=$2
        RANK_START=$3
        DEVICE_START=$4
        for((i=0;i<RANK_SIZE;i++));
        do
          export RANK_ID=$((i+RANK_START))
          export DEVICE_ID=$((i+DEVICE_START))
          rm -rf "${execute_path}"/device_$RANK_ID
          mkdir "${execute_path}"/device_$RANK_ID
          cd "${execute_path}"/device_$RANK_ID || exit
          "please input your shell script here" > train$RANK_ID.log 2>&1 &
        done

        The script creates the device_{RANK_ID} directory in the project path and you need to execute the network script in this directory. Therefore, when replacing the Python training script, pay attention to the change of its relative path.

      3. Run the run_distributed_ascend.sh script to start the original project. For example, in an eight-device environment, run the following command:
        bash run_distributed_ascend.sh RANK_TABLE_FILE RANK_SIZE RANK_START DEVICE_START
        • RANK_TABLE_FILE: Networking information file (.json) in the multi-device environment.
        • RANK_SIZE: Number of devices to be invoked.
        • RANK_START: Logical start ID of the specified device to be invoked. Currently, only 'single-server multi-device' is supported, so set the value to 0.
        • DEVICE_START: Physical start ID of the specified device to be invoked.

      For details about MindSpore distributed training (Ascend), see Distributed Parallel Training Example (Ascend).

    • Device specifies the GPU device.

      On the GPU hardware platform, MindSpore uses mpirun of OpenMPI for distributed training. You can run the following command to run the multi-device script:

      mpirun -n {number_of_GPUs_running_the_multi-device_script} {original_training_shell_script_command_of_the_model}

      For details about MindSpore distributed training (GPU), see Distributed Parallel Training Example (GPU).

  • If the Graph parameter is enabled, change the construct function of the WithLossCell class in the training script to include only the forward propagation and loss calculation of the model. For details, see Transplant advice in the migrated script.
  • The framework of the migrated script is different from that of the original script. Therefore, during the debugging and running of the migrated script, an exception may be thrown due to certain restrictions of MindSpore and the process is terminated. This type of exception needs to be further debugged and resolved based on the specific exception information.
  • After the analysis and migration, you can perform training by following the instructions in Model Training.