单卡训练手动迁移

请参考样例代码说明获取main.py脚本，和注释掉mps模块相关代码，再进行以下迁移步骤。

在main.py脚本中导入torch_npu模块。
```
import torch
import torch_npu
```

将节点的GPU计数修改为NPU计数。

代码位置：main.py文件中的main()函数。

原代码如下：

if torch.cuda.is_available():
    ngpus_per_node = torch.cuda.device_count()
else:
    ngpus_per_node = 1

修改后代码如下：

if torch_npu.npu.is_available():
        ngpus_per_node = torch_npu.npu.device_count()
    else:
        ngpus_per_node = 1

将模型以及损失函数迁移到昇腾AI处理器上进行计算。其中CUDA接口均替换为NPU接口。

原代码如下：

if not torch.cuda.is_available() and not torch.backends.mps.is_available():
        print('using CPU, this will be slow')
    elif args.distributed:
        # For multiprocessing distributed, DistributedDataParallel constructor
        # should always set the single device scope, otherwise,
        # DistributedDataParallel will use all available devices.
        if torch.cuda.is_available():
            if args.gpu is not None:
                torch.cuda.set_device(args.gpu)
                model.cuda(args.gpu)
                # When using a single GPU per process and per
                # DistributedDataParallel, we need to divide the batch size
                # ourselves based on the total number of GPUs of the current node.
                args.batch_size = int(args.batch_size / ngpus_per_node)
                args.workers = int((args.workers + ngpus_per_node - 1) / ngpus_per_node)
                model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
            else:
                model.cuda()
                # DistributedDataParallel will divide and allocate batch_size to all
                # available GPUs if device_ids are not set
                model = torch.nn.parallel.DistributedDataParallel(model)
    elif args.gpu is not None and torch.cuda.is_available():
        torch.cuda.set_device(args.gpu)
        model = model.cuda(args.gpu)
    #elif torch.backends.mps.is_available():
       #device = torch.device("mps")
       #model = model.to(device)
    else:
        # DataParallel will divide and allocate batch_size to all available GPUs
        if args.arch.startswith('alexnet') or args.arch.startswith('vgg'):
            model.features = torch.nn.DataParallel(model.features)
            model.cuda()
        else:
            model = torch.nn.DataParallel(model).cuda()

    if torch.cuda.is_available():
        if args.gpu:
            device = torch.device('cuda:{}'.format(args.gpu))
        else:
            device = torch.device("cuda")
    #elif torch.backends.mps.is_available():
        #device = torch.device("mps")
    else:
        device = torch.device("cpu")
    # define loss function (criterion), optimizer, and learning rate scheduler
    criterion = nn.CrossEntropyLoss().to(device)

修改后代码如下：

 
if not torch_npu.npu.is_available() and not torch.backends.mps.is_available():
        print('using CPU, this will be slow')
    elif args.distributed:
        # For multiprocessing distributed, DistributedDataParallel constructor
        # should always set the single device scope, otherwise,
        # DistributedDataParallel will use all available devices.
        if torch_npu.npu.is_available():
            if args.gpu is not None:
                torch_npu.npu.set_device(args.gpu)
                model.npu(args.gpu)
                # When using a single GPU per process and per
                # DistributedDataParallel, we need to divide the batch size
                # ourselves based on the total number of GPUs of the current node.
                args.batch_size = int(args.batch_size / ngpus_per_node)
                args.workers = int((args.workers + ngpus_per_node - 1) / ngpus_per_node)
                model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu])
            else:
                model.npu()
                # DistributedDataParallel will divide and allocate batch_size to all
                # available GPUs if device_ids are not set
                model = torch.nn.parallel.DistributedDataParallel(model)
    elif args.gpu is not None and torch_npu.npu.is_available():
        torch_npu.npu.set_device(args.gpu)
        model = model.npu(args.gpu)
    elif torch.backends.mps.is_available():
        device = torch.device("mps")
        model = model.to(device)
    else:
        # DataParallel will divide and allocate batch_size to all available GPUs
        if args.arch.startswith('alexnet') or args.arch.startswith('vgg'):
            model.features = torch.nn.DataParallel(model.features)
            model.npu()
        else:
            model = torch.nn.DataParallel(model).npu()

    if torch_npu.npu.is_available():
        if args.gpu:
            device = torch.device('npu:{}'.format(args.gpu))
        else:
            device = torch.device("npu")
    #elif torch.backends.mps.is_available():
        #device = torch.device("mps")
    else:
        device = torch.device("cpu")
    # define loss function (criterion), optimizer, and learning rate scheduler
    criterion = nn.CrossEntropyLoss().to(device)

将断点续训的接口和设备替换为NPU。

原代码如下：

if args.gpu is None:
    checkpoint = torch.load(args.resume)
elif torch.cuda.is_available():
    # Map model to be loaded to specified single gpu.
    loc = 'cuda:{}'.format(args.gpu)
    checkpoint = torch.load(args.resume, map_location=loc)

修改后代码如下：

if args.gpu is None:
    checkpoint = torch.load(args.resume)
elif torch_npu.npu.is_available():
    # Map model to be loaded to specified single gpu.
    loc = 'npu:{}'.format(args.gpu)
    checkpoint = torch.load(args.resume, map_location=loc)

将数据集迁移到昇腾AI处理器上进行计算。

代码位置：main.py文件中validate()函数内的run_validate()函数。

原代码中数据集在GPU上进行加载计算，原代码如下：

if args.gpu is not None and torch.cuda.is_available():
    images = images.cuda(args.gpu, non_blocking=True)
    #if torch.backends.mps.is_available():
        #images = images.to('mps')
        #target = target.to('mps')
    if torch.cuda.is_available():
        target = target.cuda(args.gpu, non_blocking=True)

将数据集迁移到NPU上进行计算，修改后代码如下：

if args.gpu is not None and torch_npu.npu.is_available():
    images = images.npu(args.gpu, non_blocking=True)
    #if torch.backends.mps.is_available():
        #images = images.to('mps')
        #target = target.to('mps')
    if torch_npu.npu.is_available():
        target = target.npu(args.gpu, non_blocking=True)

将损失取平均函数中的CUDA接口和设备替换为NPU。

代码位置：class AverageMeter(object)中的all_reduce()函数。

原代码如下：

def all_reduce(self):
    if torch.cuda.is_available():
        device = torch.device("cuda")
        ……

修改后代码如下：

def all_reduce(self):
    if torch_npu.npu.is_available():
        device = torch.device("npu")
        ……

执行训练脚本拉起训练进程，例如：

（以下参数为举例，用户可根据实际情况自行改动）

python3 main.py /home/data/resnet50/imagenet   --batch-size 128 \       # 训练批次大小，请尽量设置为处理器核数的倍数以更好的发挥性能
                                               --lr 0.1 \               # 学习率
                                               --epochs 90 \            # 训练迭代轮数
                                               --arch resnet50 \        # 模型架构
                                               --world-size 1 \
                                               --rank 0 \         
                                               --workers 40 \           # 加载数据进程数
                                               --momentum 0.9 \         # 动量  
                                               --weight-decay 1e-4 \     # 权重衰减
                                               --gpu 0                  # device号, 这里参数名称仍为gpu, 但迁移完成后实际训练设备已在代码中定义为npu

查看训练后是否生成权重文件，生成了如下图文件则说明迁移训练成功。

父主题： 样例参考