以基于ImageNet数据集的ResNet50训练脚本为例,通过自动迁移的方式将其迁移到昇腾平台。
自动迁移为在训练脚本中导入脚本转换库,然后拉起脚本执行训练,训练脚本在运行的同时会自动将脚本中的接口替换为昇腾AI处理器支持的NPU接口,整体过程为边训练边转换。
搜索“torch.backends.mps”,将相关代码注释掉即可。
if not torch.cuda.is_available() and not torch.backends.mps.is_available(): print('using CPU, this will be slow') elif args.distributed: ... ... elif args.gpu is not None and torch.cuda.is_available(): torch.cuda.set_device(args.gpu) model = model.cuda(args.gpu) elif torch.backends.mps.is_available(): device = torch.device("mps") model = model.to(device) else: ... ... if torch.cuda.is_available(): if args.gpu: device = torch.device('cuda:{}'.format(args.gpu)) else: device = torch.device("cuda") elif torch.backends.mps.is_available(): device = torch.device("mps") else: device = torch.device("cpu") ... ... def run_validate(loader, base_progress=0): ... if args.gpu is not None and torch.cuda.is_available(): images = images.cuda(args.gpu, non_blocking=True) if torch.backends.mps.is_available(): images = images.to('mps') target = target.to('mps') if torch.cuda.is_available(): target = target.cuda(args.gpu, non_blocking=True) ... ... def all_reduce(self): if torch.cuda.is_available(): device = torch.device("cuda") elif torch.backends.mps.is_available(): device = torch.device("mps") else: device = torch.device("cpu")
修改后:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | if not torch.cuda.is_available(): print('using CPU, this will be slow') elif args.distributed: ... ... elif args.gpu is not None and torch.cuda.is_available(): torch.cuda.set_device(args.gpu) model = model.cuda(args.gpu) # elif torch.backends.mps.is_available(): # device = torch.device("mps") # model = model.to(device) else: ... ... if torch.cuda.is_available(): if args.gpu: device = torch.device('cuda:{}'.format(args.gpu)) else: device = torch.device("cuda") # elif torch.backends.mps.is_available(): # device = torch.device("mps") else: device = torch.device("cpu") ... ... def run_validate(loader, base_progress=0): ... if args.gpu is not None and torch.cuda.is_available(): images = images.cuda(args.gpu, non_blocking=True) # if torch.backends.mps.is_available(): # images = images.to('mps') # target = target.to('mps') if torch.cuda.is_available(): target = target.cuda(args.gpu, non_blocking=True) ... ... def all_reduce(self): if torch.cuda.is_available(): device = torch.device("cuda") # elif torch.backends.mps.is_available(): # device = torch.device("mps") else: device = torch.device("cpu") |
将定制好的main.py文件上传至服务器,例如上传到“/home/sample”目录。
可从ImageNet官方网站https://www.image-net.org/获取数据集。
export PYTHONPATH={CANN包安装目录}/ascend-toolkit/latest/tools/ms_fmk_transplt/torch_npu_bridge:$PYTHONPATH
import torch import torch_npu ..... from torch_npu.contrib import transfer_to_npu
在main.py脚本所在路径下执行如下命令,拉起训练脚本,训练脚本执行的过程中会进行迁移。
python3 main.py /home/sample/data/resnet50/imagenet --batch-size 128 --lr 0.1 --epochs 1 --arch resnet50 --world-size 1 --rank 0 --workers 40 --momentum 0.9 --weight-decay 1e-4 --gpu 0
关键参数含义如下:
训练结束后生成“checkpoint.pth.tar”权重文件,则说明迁移训练成功。