{ASCEND_DRIVER_PATH}/tools/hccn_tool -i {device_id} -ip -g
查询命令返回结果中“ipaddr”即为NPU卡对应的ip地址{device_ip}。
{ "server_count":"1", "server_list":[ { "device":[ { "device_id":"0", "device_ip":"{device_0_ip}", "rank_id":"0" }, { "device_id":"1", "device_ip":"{device_1_ip}", "rank_id":"1" }, { "device_id":"2", "device_ip":"{device_2_ip}", "rank_id":"2" }, { "device_id":"3", "device_ip":"{device_3_ip}", "rank_id":"3" }, { "device_id":"4", "device_ip":"{device_4_ip}", "rank_id":"4" }, { "device_id":"5", "device_ip":"{device_5_ip}", "rank_id":"5" }, { "device_id":"6", "device_ip":"{device_6_ip}", "rank_id":"6" }, { "device_id":"7", "device_ip":"{device_7_ip}", "rank_id":"7" } ], "server_id":"{host_ip}" } ], "status":"completed", "version":"1.0" }
单机单卡配置样例参考如下。
{ "server_count":"1", "server_list":[ { "device":[ { "device_id":"0", "device_ip":"{device_0_ip}", "rank_id":"0" } ], "server_id":"{host_ip}" } ], "status":"completed", "version":"1.0" }
1 | bash run.sh main.py |
正常开始执行,打印信息参考如下。
1 2 3 4 5 | The ranktable solution RANK_TABLE_FILE=/xxx/example/little_demo/hccl_json_8p.json py is main.py use horovod to start tasks ... |
执行完成,日志信息显示参考如下。
1 2 3 | ASC manager has been destroyed. MPI has been destroyed. Demo done! |
使用该方案启动训练任务,需要设置如下环境变量。关于环境变量的说明可参见配置环境变量。
CM_CHIEF_IP={host_ip} CM_CHIEF_PORT=60000 CM_CHIEF_DEVICE=0 CM_WORKER_IP={host_ip} CM_WORKER_SIZE=8
下面的示例以环境中可见8卡为例,卡逻辑ID列表为[0,1,2,3,4,5,6,7]
bash run.sh main.py {host_ip}
正常开始执行,打印信息参考如下。
ip: {host_ip} available. The ranktable solution is removed. CM_CHIEF_IP={host_ip} CM_CHIEF_PORT=60000 CM_CHIEF_DEVICE=0 CM_WORKER_IP={host_ip} CM_WORKER_SIZE=8 ASCEND_VISIBLE_DEVICES=0-8 py is main.py use horovod to start tasks ...
执行完成,日志信息显示参考如下。
1 2 3 | ASC manager has been destroyed. MPI has been destroyed. Demo done! |
多机训练的配置的主要步骤为:下载Rec SDK镜像并创建容器后,修改SSH配置并启动服务,然后配置物理机节点之间的免密登录,最后配置little_demo模型,进而拉起训练进程,实现多机训练。双机时可任选一节点为主节点。
前提条件:
ss -tuln | grep 12345
若ss指令不存在需安装iproute软件包。
1 | vi /etc/ssh/sshd_config
|
如果不执行该步骤,不会影响集群训练,但是会侦听宿主机全零IP的对应端口。出于安全考虑,建议进行修改。
1 2 | cd /etc/ssh/ && ssh-keygen -A /usr/sbin/sshd |
可执行ss -tuln | grep 12345查看侦听端口是否为配置的端口。
如需停止容器内SSH服务,可在容器内执行:
1 | kill -9 `ps -ef | grep sshd | grep -v grep | awk '{print $2}'` > /dev/null 2>&1 |
将启动训练前需要配置的环境变量设置到容器内的 ~/.bashrc 文件,便于主节点免密登录到其他节点容器内时能直接使用,无需配置环境变量。
示例如下,按需配置:
1 | vi ~/.bashrc
|
1 2 3 4 5 6 7 8 9 10 | source /etc/profile source /usr/local/Ascend/ascend-toolkit/set_env.sh source /usr/local/Ascend/driver/bin/setenv.bash source /usr/local/Ascend/tfplugin/set_env.sh export PATH=/usr/local/openmpi/bin:$PATH export PATH=/usr/local/python3.7.5/bin:$PATH export PATH=/usr/local/gcc7.3.0/bin:$PATH export LD_LIBRARY_PATH=/usr/local/python3.7.5/lib:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/gcc7.3.0/lib64:$LD_LIBRARY_PATH export LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH |
1 | for i in {0..7}; do hccn_tool -i $i -ip -g ; done |
示例为双机节点“hccl_json_16p_2_host.json”配置,主节点的device信息需配置在第一个device中。{device_ip}和{host_ip}需要根据真实环境配置进行替换,rank_id需要一直升序。
{ "server_count":"2", "server_list":[ { "device":[ { "device_id":"0", "device_ip":"{device_0_ip}", "rank_id":"0" }, { "device_id":"1", "device_ip":"{device_1_ip}", "rank_id":"1" }, { "device_id":"2", "device_ip":"{device_2_ip}", "rank_id":"2" }, { "device_id":"3", "device_ip":"{device_3_ip}", "rank_id":"3" }, { "device_id":"4", "device_ip":"{device_4_ip}", "rank_id":"4" }, { "device_id":"5", "device_ip":"{device_5_ip}", "rank_id":"5" }, { "device_id":"6", "device_ip":"{device_6_ip}", "rank_id":"6" }, { "device_id":"7", "device_ip":"{device_7_ip}", "rank_id":"7" } ], "server_id":"{host_1_ip}" }, { "device":[ { "device_id":"0", "device_ip":"{device_8_ip}", "rank_id":"8" }, { "device_id":"1", "device_ip":"{device_9_ip}", "rank_id":"9" }, { "device_id":"2", "device_ip":"{device_10_ip}", "rank_id":"10" }, { "device_id":"3", "device_ip":"{device_11_ip}", "rank_id":"11" }, { "device_id":"4", "device_ip":"{device_12_ip}", "rank_id":"12" }, { "device_id":"5", "device_ip":"{device_13_ip}", "rank_id":"13" }, { "device_id":"6", "device_ip":"{device_14_ip}", "rank_id":"14" }, { "device_id":"7", "device_ip":"{device_15_ip}", "rank_id":"15" } ], "server_id":"{host_2_ip}" } ], "status":"completed", "version":"1.0" }
1 | export RANK_TABLE_FILE="${cur_path}/hccl_json_${local_rank_size}p.json" |
修改为:
1 | export RANK_TABLE_FILE="${cur_path}/hccl_json_16p_2_host.json" |
1 | xxx --mpi-args "${mpi_args}" --mpi -H localhost:${local_rank_size} |
修改为:
xxx --mpi-args "${mpi_args}" -p 12345 --mpi -H {host_1_ip}:8,{host_2_ip}:8
ssh-copy-id -i ~/.ssh/id_rsa.pub {target_host_user}@{target_host_ip}
若提示没有id_rsa.pub,可使用如下命令生成:
1 | ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa |
ssh-keygen -R {target_host_ip} # 删除当前节点中目标ip的host key缓存 ssh-keygen -R "[{target_host_ip}]:12345" # 删除当前host上保存的目标ip对应端口的host key缓存 ssh {target_host_user}@{target_host_ip} # 验证免密登录目标ip ssh {target_host_user}@{target_host_ip} -p 12345 # 验证免密登录目标ip指定端口
1 | bash run.sh main.py |
bash run.sh main.py {host_1_ip}