相比于X86服务器,ARM服务器通常CPU核数更多,但单核性能更弱,因此更容易触发内核的负载均衡策略,该策略是通过启用进程迁移来降低繁忙的处理器压力。进程迁移会导致进程上下文切换、降低Cache命中率和跨numa内存访问等,从而影响训练性能。该问题通常发生在多卡场景。
在进行多卡训练时,通过htop命令查看各个核心CPU的占用情况。
在多卡场景下,通过shell脚本拉起多进程的方式实现多进程绑核,如果原始的Python脚本中使用torch.distributed.launch启动多进程,则需要对模型脚本进行改造,并且需要在shell脚本中通过for循环拉起脚本,同时通过taskset命令为每个进程设置CPU序号。
export PYTHONPATH={CANN包安装目录}/ascend-toolkit/latest/tools/ms_fmk_transplt/torch_npu_bridge:$PYTHONPATH
import torch_npu import transfer_to_npu
..... if args.multiprocessing_distributed: # Since we have ngpus_per_node processes per node, the total world_size # needs to be adjusted accordingly args.world_size = ngpus_per_node * args.world_size # Use torch.multiprocessing.spawn to launch distributed processes: the # main_worker process function mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) else: # Simply call main_worker function main_worker(args.gpu, ngpus_per_node, args) .....
结合训练脚本main_worker()函数中初始化集合通信的代码。
if args.distributed: if args.dist_url == "env://" and args.rank == -1: args.rank = int(os.environ["RANK"]) if args.multiprocessing_distributed: # For multiprocessing distributed training, rank needs to be the # global rank among all the processes args.rank = args.rank * ngpus_per_node + gpu dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url, world_size=args.world_size, rank=args.rank)
这里的args.rank可以理解为训练设备的编号,当进行单机多卡训练时,这里的rank即为0,所以集合通信初始化中传入的rank参数即为传入main_worker()函数的参数args.gpu。
..... if args.multiprocessing_distributed: # Since we have ngpus_per_node processes per node, the total world_size # needs to be adjusted accordingly args.world_size = ngpus_per_node * args.world_size # Use torch.multiprocessing.spawn to launch distributed processes: the # main_worker process function # 注释掉原始的torch.distributed.launch启动多进程方式 # mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args)) main_worker(args.gpu, ngpus_per_node, args) else: # Simply call main_worker function main_worker(args.gpu, ngpus_per_node, args) .....
我们需要在启动训练脚本的Python命令中加入args.gpu参数以确保在每张卡上都启动main()函数。
python main.py -a resnet50 --dist-url 'tcp://127.0.0.1:FREEPORT' --dist-backend 'nccl' --multiprocessing-distributed --world-size 1 --rank 0 [imagenet-folder with train and val folders]
RANK_ID_START=0 RANK_SIZE=8 # 绑定CPU核心数量 KERNEL_NUM=$(($(nproc)/8)) for((RANK_ID=$RANK_ID_START;RANK_ID<$((RANK_SIZE+RANK_ID_START));RANK_ID++)); do # 当前Python进程绑定的起始CPU编号 PID_START=$((KERNEL_NUM * RANK_ID)) # 当前Python进程绑定的终止CPU编号 PID_END=$((PID_START + KERNEL_NUM - 1)) # 通过taskset设置Python进程绑定的CPU编号。 taskset -c $PID_START-$PID_END python3 main.py -a resnet50 --dist-url 'tcp://127.0.0.1:FREEPORT' --dist-backend 'hccl' --multiprocessing-distributed --world-size 1 --rank 0 [imagenet-folder with train and val folders] --gpu $RANK_ID & done
上述命令中[imagenet-folder with train and val folders]参数需用户自行配置数据集路径,在设置好训练相关环境变量后,即可通过运行上述shell脚本实现多进程绑核。