Migrating GPU Single-device Scripts to NPU Multi-device Scripts

If distributed is enabled during the migration and you want to migrate the GPU single-device scripts to NPU multi-device scripts, perform the following operations to obtain the result file.

Replace training script statements.

Replace the please input your shell script here statement in the run_distributed_npu.sh file generated after the migration command is executed with the original training shell script of the model. For example, replace please input your shell script here with the model training command bash model_train_script.sh --data_path data_path.

The content of the run_distributed_npu.sh file is as follows.

export MASTER_ADDR=127.0.0.1 
export MASTER_PORT=29688 
export HCCL_WHITELIST_DISABLE=1    
 
NPUS=($(seq 0 7)) 
export RANK_SIZE=${#NPUS[@]} 
rank=0 
for i in ${NPUS[@]} 
do 
    export DEVICE_ID=${i} 
    export RANK_ID=${rank} 
    echo run process ${rank} 
    please input your shell script here > output_npu_${i}.log 2>&1 & 
    let rank++ 
done

**Table 1** Parameters in run_distributed_npu.sh
Parameter	Description
MASTER_ADDR	IP address of the training server.
MASTER_PORT	Port of the training server.
HCCL_WHITELIST_DISABLE	Trustlist verification for HCCL communication.
NPUS	NPUs specified for script running.
RANK_SIZE	Number of devices to be invoked.
DEVICE_ID	ID of the device to be invoked.
RANK_ID	Logical ID of the device to be invoked.

After the replacement, execute the run_distributed_npu.sh file to generate the log of the specified NPU.

View result files.

After the script migration is complete, go to the result output path to view the result files. The following uses the migration from GPU single-device scripts to NPU multi-device scripts as an example. The result files are as follows.

├── xxx_msft/xxx_msft_multi              // Directory for storing the script migration result.
│   ├── generated_script_file    // The directory structure is the same as that of the script file directory before the migration.
│   ├── msFmkTranspltlog.txt         // Script porting log file. The maximum size of a log file is 1 MB. If the size of a log file exceeds 1 MB, it is stored in multiple files. A maximum of 10 files are supported.
│   ├── cuda_op_list.csv            // List of analyzed CUDA operators.
│   ├── unknown_api.csv             // List of APIs with uncertain support status.
│   ├── unsupported_api.csv         // List of unsupported APIs.
│   ├── change_list.csv              // Change history file.
│   ├── run_distributed_npu.sh       // Multi-device boot shell script.

Check the migrated .py script. You can see that the CUDA APIs in the script are replaced with the NPU APIs.

def main():
    args = parser.parse_args()
 
    if args.seed is not None:
        random.seed(args.seed)
        torch.manual_seed(args.seed)
        cudnn.deterministic = True
        cudnn.benchmark = False
        warnings.warn('You have chosen to seed training. '
                      'This will turn on the CUDNN deterministic setting, '
                      'which can slow down your training considerably! '
                      'You may see unexpected behavior when restarting '
                      'from checkpoints.')
 
    if args.gpu is not None:
        warnings.warn('You have chosen a specific GPU. This will completely '
                      'disable data parallelism.')
 
    if args.dist_url == "env://" and args.world_size == -1:
        args.world_size = int(os.environ["WORLD_SIZE"])
 
    args.distributed = args.world_size > 1 or args.multiprocessing_distributed
 
    if torch_npu.npu.is_available():
        ngpus_per_node = torch_npu.npu.device_count()
    else:
        ngpus_per_node = 1
    if args.multiprocessing_distributed:
        # Since we have ngpus_per_node processes per node, the total world_size
        # needs to be adjusted accordingly
        args.world_size = ngpus_per_node * args.world_size
        # Use torch.multiprocessing.spawn to launch distributed processes: the
        # main_worker process function
        mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
    else:
        # Simply call main_worker function
        main_worker(args.gpu, ngpus_per_node, args)

Parent topic: Typical Cases