原理

相比于X86服务器，ARM服务器通常CPU核数更多，但单核性能更弱，因此更容易触发内核的负载均衡策略，该策略是通过启用进程迁移来降低繁忙的处理器压力。进程迁移会导致进程上下文切换、降低Cache命中率和跨numa内存访问等，从而影响训练性能。该问题通常发生在多卡场景。

问题发现

在进行多卡训练时，通过htop命令查看各个核心CPU的占用情况。

点击放大

操作方法

在多卡场景下，通过shell脚本拉起多进程的方式实现多进程绑核，如果原始的Python脚本中使用torch.distributed.launch启动多进程，则需要对模型脚本进行改造，并且需要在shell脚本中通过for循环拉起脚本，同时通过taskset命令为每个进程设置CPU序号。

这里以PyTorch官网提供的Imagenet数据集训练脚本main.py为例，说明多进程绑核的方法。

首先需要配置环境变量。

export PYTHONPATH={CANN包安装目录}/ascend-toolkit/latest/tools/ms_fmk_transplt/torch_npu_bridge:$PYTHONPATH

在该训练脚本中导入以下库代码。
```
import torch_npu
import transfer_to_npu
```

改造原始脚本。

该训练脚本是通过torch.distributed.launch来启动多进程，代码位于训练脚本入口函数mian()中，具体代码如下：

    .....
    if args.multiprocessing_distributed:
        # Since we have ngpus_per_node processes per node, the total world_size
        # needs to be adjusted accordingly
        args.world_size = ngpus_per_node * args.world_size
        # Use torch.multiprocessing.spawn to launch distributed processes: the
        # main_worker process function
        mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
    else:
        # Simply call main_worker function
        main_worker(args.gpu, ngpus_per_node, args)
    .....

结合训练脚本main_worker()函数中初始化集合通信的代码。

if args.distributed:
        if args.dist_url == "env://" and args.rank == -1:
            args.rank = int(os.environ["RANK"])
        if args.multiprocessing_distributed:
            # For multiprocessing distributed training, rank needs to be the
            # global rank among all the processes
            args.rank = args.rank * ngpus_per_node + gpu
        dist.init_process_group(backend=args.dist_backend, init_method=args.dist_url,
                                world_size=args.world_size, rank=args.rank)

这里的args.rank可以理解为训练设备的编号，当进行单机多卡训练时，这里的rank即为0，所以集合通信初始化中传入的rank参数即为传入main_worker()函数的参数args.gpu。

因此当我们需要通过shell脚本启动多进程时，相当于将main()函数在每个进程中都执行，所以需要将mian()函数中代码改造成如下：

    .....
    if args.multiprocessing_distributed:
        # Since we have ngpus_per_node processes per node, the total world_size
        # needs to be adjusted accordingly
        args.world_size = ngpus_per_node * args.world_size
        # Use torch.multiprocessing.spawn to launch distributed processes: the
        # main_worker process function
        # 注释掉原始的torch.distributed.launch启动多进程方式
        # mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
        main_worker(args.gpu, ngpus_per_node, args)
    else:
        # Simply call main_worker function
        main_worker(args.gpu, ngpus_per_node, args)
    .....

我们需要在启动训练脚本的Python命令中加入args.gpu参数以确保在每张卡上都启动main()函数。

改造原始训练启动命令。

原始单机多卡启动命令如下。

python main.py -a resnet50 --dist-url 'tcp://127.0.0.1:FREEPORT' --dist-backend 'nccl' --multiprocessing-distributed --world-size 1 --rank 0 [imagenet-folder with train and val folders]

这里建立一个shell脚本run.sh，结合2. 改造原始脚本相关描述，run.sh脚本内容如下：

RANK_ID_START=0
RANK_SIZE=8
# 绑定CPU核心数量
KERNEL_NUM=$(($(nproc)/8))

for((RANK_ID=$RANK_ID_START;RANK_ID<$((RANK_SIZE+RANK_ID_START));RANK_ID++));
do

# 当前Python进程绑定的起始CPU编号
PID_START=$((KERNEL_NUM * RANK_ID))
# 当前Python进程绑定的终止CPU编号
PID_END=$((PID_START + KERNEL_NUM - 1))

# 通过taskset设置Python进程绑定的CPU编号。
taskset -c $PID_START-$PID_END python3 main.py -a resnet50 --dist-url 'tcp://127.0.0.1:FREEPORT' --dist-backend 'hccl' --multiprocessing-distributed --world-size 1 --rank 0 [imagenet-folder with train and val folders] --gpu $RANK_ID & 
done

上述命令中[imagenet-folder with train and val folders]参数需用户自行配置数据集路径，在设置好训练相关环境变量后，即可通过运行上述shell脚本实现多进程绑核。

启动上述训练命令后，通过htop命令查看显示如下，可以发现CPU占用非常均衡。

ARM架构多进程绑核

原理

问题发现

操作方法