Migration Operation
Procedure
- Start script migration.You can initiate a script migration task in any of the following ways:
- Choose on the toolbar.
- Click
on the toolbar. - Right-click a folder in the project directory and choose X2MindSpore from the shortcut menu.
- Configure parameters as required.
Figure 1 X2MindSpore parameter configuration page shows the page after X2MindSpore is started. Configure parameters as required.
Figure 2 X2MindSpore parameter configuration page (Graph enabled)
Table 1 X2MindSpore parameters Parameter
Description
Framework
(Required) Framework of the original script to be migrated.
The values are as follows.
- PyTorch (default)
- TensorFlow 1
- TensorFlow 2
Input Path
(Required) Directory of the original project to be migrated.
Output Path
(Required) Output path of the analysis and migration result file. A directory with the .x2ms or .x2ms_multi suffix will be generated in the path.
If migration to multi-device is disabled, the output directory name is xxx_x2ms. If migration to multi-device is enabled, the output directory name is xxx_x2ms_multi. xxx indicates the name of the folder where the original script is located.
Distributed
(Optional) Used to migrate GPU single-device scripts to multi-device scripts. This parameter is available when Framework is set to PyTorch and TensorFlow 2. If this parameter is enabled, the Device parameter is displayed. Currently, this function is unavailable in TensorFlow 1, and is disabled by default.
Device
(Required) Used to migrate GPU single-device scripts to multi-device scripts of specified devices. This parameter is displayed when Distributed is enabled. The values are as follows.
- Ascend (default)
- GPU
Graph
The migrated scripts can run in Graph mode in MindSpore 1.8 or later. This parameter is available when Framework is set to PyTorch. If this parameter is enabled, the Target Model parameter is displayed. It is disabled by default, that is, the scripts are migrated to the PyNative mode by default.
Currently, only ResNet and BiT series models in Table 1 can be migrated to the Graph mode. This parameter cannot be used together with Distributed.
Target Model
(Optional) This parameter is available only when Graph is enabled. It indicates the variable name of the target model and the default value is model.
- Click Transplant to execute the migration task.
After the migration, check the result file in the Output Path directory.
├── xxx_x2ms/xxx_x2ms_multi // Directory for storing the script migration result. │ ├── migrated script file // The directory structure is the same as that of the script file directory before migration. │ ├── x2ms_adapter // Mediation file. │ ├── unsupported_api.csv // File of unsupported APIs. │ ├── custom_supported_api.csv // File of supported APIs customized for the tool (only training scripts of the PyTorch framework are supported). │ ├── supported_api.csv // File of supported APIs. │ ├── deleted_api.csv // File of deleted APIs. │ ├── x2mindspore.log // Migration log. The maximum size of a log file is 1 MB. If the size of a log file exceeds 1 MB, it is stored in multiple files. A maximum of 10 files are supported. │ ├── run_distributed_ascend.sh // Shell script for starting the multi-device function, which is generated when the Distributed parameter is enabled and Ascend is set for Device. │ ├── rank_table_2pcs.json // Example file for 2-device environment networking information, which is generated when the Distributed parameter is enabled and Ascend is set for Device. │ ├── rank_table_8pcs.json // Example file for 8-device environment networking information, which is generated when the Distributed parameter is enabled and Ascend is set for Device.
- Before executing the migrated model files, add the output project path to the environment variable PYTHONPATH.
After Migration
- If the Distributed parameter is enabled, you need to run the multi-device script on the device specified by Device after the migration.
- Device specifies the Ascend device.
- Refer to Configuring Distributed Environment Variables to configure the generated distributed environment variable file.
- Replace the please input your shell script here statement in the run_distributed_ascend.sh file with the execution command of the original Python training script of the model.
#!/bin/bash echo "==============================================================================================================" echo "Please run the script as: " echo "bash run_distributed_ascend.sh RANK_TABLE_FILE RANK_SIZE RANK_START DEVICE_START" echo "For example: bash run_distributed_ascend.sh /path/rank_table.json 8 0 0" echo "It is better to use the absolute path." echo "==============================================================================================================" execute_path=$(pwd) echo "${execute_path}" export RANK_TABLE_FILE=$1 export RANK_SIZE=$2 RANK_START=$3 DEVICE_START=$4 for((i=0;i<RANK_SIZE;i++)); do export RANK_ID=$((i+RANK_START)) export DEVICE_ID=$((i+DEVICE_START)) rm -rf "${execute_path}"/device_$RANK_ID mkdir "${execute_path}"/device_$RANK_ID cd "${execute_path}"/device_$RANK_ID || exit "please input your shell script here" > train$RANK_ID.log 2>&1 & doneTable 2 Parameters Parameter
Description
RANK_TABLE_FILE
Networking information file in the multi-device environment.
RANK_SIZE
Number of the Ascend AI Processors.
RANK_START
Logical start ID of the Ascend AI Processor to be invoked. Currently, only single-server multi-device is supported, so set the value to 0.
DEVICE_START
Physical start ID of the Ascend AI Processor to be invoked.
The script creates the device_{RANK_ID} directory in the project path and you need to execute the networking script in this directory. Therefore, when replacing the Python training script, pay attention to the change of its relative path.
- Run the run_distributed_ascend.sh script to start the original project. For example, in an 8-device environment, run the following command:
bash run_distributed_ascend.sh RANK_TABLE_FILE RANK_SIZE RANK_START DEVICE_START
For details about MindSpore distributed training (Ascend), see Distributed Parallel Training Example (Ascend).
- Device specifies the GPU device.
On the GPU hardware platform, MindSpore uses mpirun of OpenMPI for distributed training. You can run the following command to run the multi-device script:
mpirun -n {number_of_GPUs_running_the_multi-device_script} {original_training_shell_script_command_of_the_model}For details about MindSpore distributed training (GPU), see Distributed Parallel Training Example (GPU).
- Device specifies the Ascend device.
- If the Graph parameter is enabled, change the construct function of the WithLossCell class in the training script to include only the forward propagation and loss calculation of the model. For details, see Transplant advice in the migrated script.
- The framework of the migrated script is different from that of the original script. Therefore, during the debugging and running of the migrated script, an exception may be thrown due to some restrictions of MindSpore and the process is terminated. This type of exception needs to be further debugged and resolved based on the specific exception information.
- After the analysis and migration, you can perform training by following the instructions in Model Training.
