Migration Operation

Procedure

  1. Start script migration.
    You can initiate a script migration task in any of the following ways:
    • Choose Ascend > Migration Tools > X2MindSpore on the toolbar.
    • Click on the toolbar.
    • Right-click a folder in the project directory and choose X2MindSpore from the shortcut menu.
  2. Configure parameters as required.

    Figure 1 X2MindSpore parameter configuration page shows the page after X2MindSpore is started. Configure parameters as required.

    Figure 1 X2MindSpore parameter configuration page (Distributed enabled)
    Figure 2 X2MindSpore parameter configuration page (Graph enabled)
    Table 1 X2MindSpore parameters

    Parameter

    Description

    Framework

    (Required) Framework of the original script to be migrated.

    The values are as follows.

    • PyTorch (default)
    • TensorFlow 1
    • TensorFlow 2

    Input Path

    (Required) Directory of the original project to be migrated.

    Output Path

    (Required) Output path of the analysis and migration result file. A directory with the .x2ms or .x2ms_multi suffix will be generated in the path.

    If migration to multi-device is disabled, the output directory name is xxx_x2ms. If migration to multi-device is enabled, the output directory name is xxx_x2ms_multi. xxx indicates the name of the folder where the original script is located.

    Distributed

    (Optional) Used to migrate GPU single-device scripts to multi-device scripts. This parameter is available when Framework is set to PyTorch and TensorFlow 2. If this parameter is enabled, the Device parameter is displayed. Currently, this function is unavailable in TensorFlow 1, and is disabled by default.

    Device

    (Required) Used to migrate GPU single-device scripts to multi-device scripts of specified devices. This parameter is displayed when Distributed is enabled. The values are as follows.

    • Ascend (default)
    • GPU

    Graph

    The migrated scripts can run in Graph mode in MindSpore 1.8 or later. This parameter is available when Framework is set to PyTorch. If this parameter is enabled, the Target Model parameter is displayed. It is disabled by default, that is, the scripts are migrated to the PyNative mode by default.

    Currently, only ResNet and BiT series models in Table 1 can be migrated to the Graph mode. This parameter cannot be used together with Distributed.

    Target Model

    (Optional) This parameter is available only when Graph is enabled. It indicates the variable name of the target model and the default value is model.

  3. Click Transplant to execute the migration task.

    After the migration, check the result file in the Output Path directory.

    ├── xxx_x2ms/xxx_x2ms_multi              // Directory for storing the script migration result.
    │   ├── migrated script file             // The directory structure is the same as that of the script file directory before migration.
    │   ├── x2ms_adapter                 // Mediation file.
    │   ├── unsupported_api.csv          // File of unsupported APIs.
    │   ├── custom_supported_api.csv     // File of supported APIs customized for the tool (only training scripts of the PyTorch framework are supported).
    │   ├── supported_api.csv            // File of supported APIs.
    │   ├── deleted_api.csv              // File of deleted APIs.
    │   ├── x2mindspore.log              // Migration log. The maximum size of a log file is 1 MB. If the size of a log file exceeds 1 MB, it is stored in multiple files. A maximum of 10 files are supported.
    │   ├── run_distributed_ascend.sh       // Shell script for starting the multi-device function, which is generated when the Distributed parameter is enabled and Ascend is set for Device.
    │   ├── rank_table_2pcs.json            // Example file for 2-device environment networking information, which is generated when the Distributed parameter is enabled and Ascend is set for Device.
    │   ├── rank_table_8pcs.json            // Example file for 8-device environment networking information, which is generated when the Distributed parameter is enabled and Ascend is set for Device.
  4. Before executing the migrated model files, add the output project path to the environment variable PYTHONPATH.

After Migration

  • If the Distributed parameter is enabled, you need to run the multi-device script on the device specified by Device after the migration.
    • Device specifies the Ascend device.
      1. Refer to Configuring Distributed Environment Variables to configure the generated distributed environment variable file.
      2. Replace the please input your shell script here statement in the run_distributed_ascend.sh file with the execution command of the original Python training script of the model.
        #!/bin/bash
        echo "=============================================================================================================="
        echo "Please run the script as: "
        echo "bash run_distributed_ascend.sh RANK_TABLE_FILE RANK_SIZE RANK_START DEVICE_START"
        echo "For example: bash run_distributed_ascend.sh /path/rank_table.json 8 0 0"
        echo "It is better to use the absolute path."
        echo "=============================================================================================================="
        execute_path=$(pwd)
        echo "${execute_path}"
        export RANK_TABLE_FILE=$1
        export RANK_SIZE=$2
        RANK_START=$3
        DEVICE_START=$4
        for((i=0;i<RANK_SIZE;i++));
        do
          export RANK_ID=$((i+RANK_START))
          export DEVICE_ID=$((i+DEVICE_START))
          rm -rf "${execute_path}"/device_$RANK_ID
          mkdir "${execute_path}"/device_$RANK_ID
          cd "${execute_path}"/device_$RANK_ID || exit
          "please input your shell script here" > train$RANK_ID.log 2>&1 &
        done
        Table 2 Parameters

        Parameter

        Description

        RANK_TABLE_FILE

        Networking information file in the multi-device environment.

        RANK_SIZE

        Number of the Ascend AI Processors.

        RANK_START

        Logical start ID of the Ascend AI Processor to be invoked. Currently, only single-server multi-device is supported, so set the value to 0.

        DEVICE_START

        Physical start ID of the Ascend AI Processor to be invoked.

        The script creates the device_{RANK_ID} directory in the project path and you need to execute the networking script in this directory. Therefore, when replacing the Python training script, pay attention to the change of its relative path.

      3. Run the run_distributed_ascend.sh script to start the original project. For example, in an 8-device environment, run the following command:
        bash run_distributed_ascend.sh RANK_TABLE_FILE RANK_SIZE RANK_START DEVICE_START

      For details about MindSpore distributed training (Ascend), see Distributed Parallel Training Example (Ascend).

    • Device specifies the GPU device.

      On the GPU hardware platform, MindSpore uses mpirun of OpenMPI for distributed training. You can run the following command to run the multi-device script:

      mpirun -n {number_of_GPUs_running_the_multi-device_script} {original_training_shell_script_command_of_the_model}

      For details about MindSpore distributed training (GPU), see Distributed Parallel Training Example (GPU).

  • If the Graph parameter is enabled, change the construct function of the WithLossCell class in the training script to include only the forward propagation and loss calculation of the model. For details, see Transplant advice in the migrated script.
  • The framework of the migrated script is different from that of the original script. Therefore, during the debugging and running of the migrated script, an exception may be thrown due to some restrictions of MindSpore and the process is terminated. This type of exception needs to be further debugged and resolved based on the specific exception information.
  • After the analysis and migration, you can perform training by following the instructions in Model Training.