单卡训练手动迁移
请参考样例代码说明获取main.py脚本,和注释掉mps模块相关代码,再进行以下迁移步骤。
- 在main.py脚本中导入torch_npu模块。
import torch import torch_npu
- 将节点的GPU计数修改为NPU计数。
原代码如下:
if torch.cuda.is_available(): ngpus_per_node = torch.cuda.device_count() else: ngpus_per_node = 1
修改后代码如下:if torch_npu.npu.is_available(): ngpus_per_node = torch_npu.npu.device_count() else: ngpus_per_node = 1
- 将模型以及损失函数迁移到昇腾AI处理器上进行计算。其中CUDA接口均替换为NPU接口。
原代码如下:
if not torch.cuda.is_available() and not torch.backends.mps.is_available(): print('using CPU, this will be slow') elif args.distributed: # For multiprocessing distributed, DistributedDataParallel constructor # should always set the single device scope, otherwise, # DistributedDataParallel will use all available devices. if torch.cuda.is_available(): if args.gpu is not None: torch.cuda.set_device(args.gpu) model.cuda(args.gpu) # When using a single GPU per process and per # DistributedDataParallel, we need to divide the batch size # ourselves based on the total number of GPUs of the current node. args.batch_size = int(args.batch_size / ngpus_per_node) args.workers = int((args.workers + ngpus_per_node - 1) / ngpus_per_node) model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu]) else: model.cuda() # DistributedDataParallel will divide and allocate batch_size to all # available GPUs if device_ids are not set model = torch.nn.parallel.DistributedDataParallel(model) elif args.gpu is not None and torch.cuda.is_available(): torch.cuda.set_device(args.gpu) model = model.cuda(args.gpu) #elif torch.backends.mps.is_available(): #device = torch.device("mps") #model = model.to(device) else: # DataParallel will divide and allocate batch_size to all available GPUs if args.arch.startswith('alexnet') or args.arch.startswith('vgg'): model.features = torch.nn.DataParallel(model.features) model.cuda() else: model = torch.nn.DataParallel(model).cuda() if torch.cuda.is_available(): if args.gpu: device = torch.device('cuda:{}'.format(args.gpu)) else: device = torch.device("cuda") #elif torch.backends.mps.is_available(): #device = torch.device("mps") else: device = torch.device("cpu") # define loss function (criterion), optimizer, and learning rate scheduler criterion = nn.CrossEntropyLoss().to(device)
修改后代码如下:if not torch_npu.npu.is_available() and not torch.backends.mps.is_available(): print('using CPU, this will be slow') elif args.distributed: # For multiprocessing distributed, DistributedDataParallel constructor # should always set the single device scope, otherwise, # DistributedDataParallel will use all available devices. if torch_npu.npu.is_available(): if args.gpu is not None: torch_npu.npu.set_device(args.gpu) model.npu(args.gpu) # When using a single GPU per process and per # DistributedDataParallel, we need to divide the batch size # ourselves based on the total number of GPUs of the current node. args.batch_size = int(args.batch_size / ngpus_per_node) args.workers = int((args.workers + ngpus_per_node - 1) / ngpus_per_node) model = torch.nn.parallel.DistributedDataParallel(model, device_ids=[args.gpu]) else: model.npu() # DistributedDataParallel will divide and allocate batch_size to all # available GPUs if device_ids are not set model = torch.nn.parallel.DistributedDataParallel(model) elif args.gpu is not None and torch_npu.npu.is_available(): torch_npu.npu.set_device(args.gpu) model = model.npu(args.gpu) elif torch.backends.mps.is_available(): device = torch.device("mps") model = model.to(device) else: # DataParallel will divide and allocate batch_size to all available GPUs if args.arch.startswith('alexnet') or args.arch.startswith('vgg'): model.features = torch.nn.DataParallel(model.features) model.npu() else: model = torch.nn.DataParallel(model).npu() if torch_npu.npu.is_available(): if args.gpu: device = torch.device('npu:{}'.format(args.gpu)) else: device = torch.device("npu") #elif torch.backends.mps.is_available(): #device = torch.device("mps") else: device = torch.device("cpu") # define loss function (criterion), optimizer, and learning rate scheduler criterion = nn.CrossEntropyLoss().to(device)
- 将断点续训的接口和设备替换为NPU。
if args.gpu is None: checkpoint = torch.load(args.resume) elif torch.cuda.is_available(): # Map model to be loaded to specified single gpu. loc = 'cuda:{}'.format(args.gpu) checkpoint = torch.load(args.resume, map_location=loc)
修改后代码如下:
if args.gpu is None: checkpoint = torch.load(args.resume) elif torch_npu.npu.is_available(): # Map model to be loaded to specified single gpu. loc = 'npu:{}'.format(args.gpu) checkpoint = torch.load(args.resume, map_location=loc)
- 将数据集迁移到昇腾AI处理器上进行计算。
代码位置:main.py文件中validate()函数内的run_validate()函数。
原代码中数据集在GPU上进行加载计算,原代码如下:if args.gpu is not None and torch.cuda.is_available(): images = images.cuda(args.gpu, non_blocking=True) #if torch.backends.mps.is_available(): #images = images.to('mps') #target = target.to('mps') if torch.cuda.is_available(): target = target.cuda(args.gpu, non_blocking=True)
将数据集迁移到NPU上进行计算,修改后代码如下:if args.gpu is not None and torch_npu.npu.is_available(): images = images.npu(args.gpu, non_blocking=True) #if torch.backends.mps.is_available(): #images = images.to('mps') #target = target.to('mps') if torch_npu.npu.is_available(): target = target.npu(args.gpu, non_blocking=True)
- 将损失取平均函数中的CUDA接口和设备替换为NPU。
代码位置:class AverageMeter(object)中的all_reduce()函数。
原代码如下:
def all_reduce(self): if torch.cuda.is_available(): device = torch.device("cuda") ……
修改后代码如下:
def all_reduce(self): if torch_npu.npu.is_available(): device = torch.device("npu") ……
- 执行训练脚本拉起训练进程,例如:
(以下参数为举例,用户可根据实际情况自行改动)
python3 main.py /home/data/resnet50/imagenet --batch-size 128 \ # 训练批次大小,请尽量设置为处理器核数的倍数以更好的发挥性能 --lr 0.1 \ # 学习率 --epochs 90 \ # 训练迭代轮数 --arch resnet50 \ # 模型架构 --world-size 1 \ --rank 0 \ --workers 40 \ # 加载数据进程数 --momentum 0.9 \ # 动量 --weight-decay 1e-4 \ # 权重衰减 --gpu 0 # device号, 这里参数名称仍为gpu, 但迁移完成后实际训练设备已在代码中定义为npu
- 查看训练后是否生成权重文件,生成了如下图文件则说明迁移训练成功。
父主题: 样例参考