本章节以HCCL-Controller为例,介绍了制作MindX DL组件容器部署的镜像制作示例,软件包中的Dockerfile仅作为参考,用户可基于本示例制作定制化镜像。用户也可以从昇腾镜像仓库获取基于制作镜像示例中的步骤制作好的MindX DL各组件的镜像,具体操作可参考从昇腾镜像仓库拉取镜像。镜像制作完成后,请及时进行安全加固。如修复基础镜像的漏洞、安装第三方依赖导致的漏洞等。
组件 |
镜像名称 |
镜像tag |
拉取镜像的节点 |
说明 |
---|---|---|---|---|
Resilience-Controller |
v3.0.0 |
管理节点 |
拉取的镜像需要重命名后才能使用组件启动yaml进行部署,可参考3。 |
|
Volcano |
v1.4.0-v3.0.0 |
|||
HCCL-Controller |
v3.0.0 |
|||
NodeD |
v3.0.0 |
计算节点 |
||
NPU-Exporter |
v3.0.0 |
|||
Ascend Device Plugin |
v3.0.0 |
docker tag ascendhub.huawei.com/public-ascendhub/resilience-controller:v3.0.0 resilience-controller:v3.0.0 docker tag ascendhub.huawei.com/public-ascendhub/npu-exporter:v3.0.0 npu-exporter:v3.0.0 docker tag ascendhub.huawei.com/public-ascendhub/ascend-k8sdeviceplugin:v3.0.0 ascend-k8sdeviceplugin:v3.0.0 docker tag ascendhub.huawei.com/public-ascendhub/vc-controller-manager:v1.4.0-v3.0.0 volcanosh/vc-controller-manager:v1.4.0 docker tag ascendhub.huawei.com/public-ascendhub/vc-scheduler:v1.4.0-v3.0.0 volcanosh/vc-scheduler:v1.4.0 docker tag ascendhub.huawei.com/public-ascendhub/noded:v3.0.0 noded:v3.0.0 docker tag ascendhub.huawei.com/public-ascendhub/hccl-controller:v3.0.0 hccl-controller:v3.0.0
docker rmi ascendhub.huawei.com/public-ascendhub/resilience-controller:v3.0.0 docker rmi ascendhub.huawei.com/public-ascendhub/npu-exporter:v3.0.0 docker rmi ascendhub.huawei.com/public-ascendhub/ascend-k8sdeviceplugin:v3.0.0 docker rmi ascendhub.huawei.com/public-ascendhub/vc-controller-manager:v1.4.0-v3.0.0 docker rmi ascendhub.huawei.com/public-ascendhub/vc-scheduler:v1.4.0-v3.0.0 docker rmi ascendhub.huawei.com/public-ascendhub/noded:v3.0.0 docker rmi ascendhub.huawei.com/public-ascendhub/hccl-controller:v3.0.0
root@ubuntu:/home/ascend-hccl-controller# ll total 66328 drwxr-xr-x 3 root root 4096 Jun 24 20:24 ./ drwxr-x--- 9 root root 4096 Jun 24 20:24 ../ -r-x------ 1 root root 31313408 Jun 22 04:00 cert-importer* -r-------- 1 root root 677 Jun 22 04:00 Dockerfile -r-x------ 1 root root 36578912 Jun 22 04:00 hccl-controller* -r-------- 1 root root 2493 Jun 22 04:00 hccl-controller-v3.0.0.yaml -r-------- 1 root root 1611 Jun 22 04:00 hccl-controller-without-token-v3.0.0.yaml dr-xr-x--- 2 root root 4096 Jun 22 04:00 lib/
NPU-Exporter和Ascend Device Plugin若以容器化的形式部署在Atlas 200I Soc A1 核心板上,在制作镜像时需要检查宿主机HwHiAiUser、HwDmUser、HwBaseUser用户的gid和uid,并记录该gid和uid的取值。查看在Dockerfile-310P-1usoc中创建HwHiAiUser、HwDmUser、HwBaseUser用户时指定的gid和uid是否与宿主机的一致。如果不一致,请手动修改Dockerfile-310P-1usoc文件使其保持一致,同时需要保证每台宿主机上HwHiAiUser、HwDmUser、HwBaseUser用户的gid和uid的取值一致。
root@ubuntu:# docker images | grep ubuntu ubuntu 18.04 6526a1858e5d 2 years ago 64.2MB
root@ubuntu:# docker images | grep alpine alpine latest a24bb4013296 2 years ago 5.57MB
若上述基础镜像不存在,使用表 获取基础镜像命令拉取基础镜像(拉取镜像需要服务器能联网)。
节点类型 |
组件名称 |
镜像制作命令 |
说明 |
---|---|---|---|
其他类型 |
Ascend Device Plugin |
docker build --no-cache -t ascend-k8sdeviceplugin:{tag} ./ |
{tag}需要参考软件包上的版本。如:软件包上版本为3.0.0,则{tag}为v3.0.0。 |
Atlas 200I Soc A1 核心板 |
docker build --no-cache -t ascend-k8sdeviceplugin:{tag} -f Dockerfile-310P-1usoc ./ |
||
其他类型 |
NPU-Exporter |
docker build --no-cache -t npu-exporter:{tag} ./ |
|
Atlas 200I Soc A1 核心板 |
docker build --no-cache -t npu-exporter:{tag} -f Dockerfile-310P-1usoc ./ |
||
其他类型 |
HCCL-Controller |
docker build --no-cache -t hccl-controller:{tag} ./ |
|
Resilience-Controller |
docker build --no-cache -t resilience-controller:{tag} ./ |
||
NodeD |
docker build --no-cache -t noded:{tag} ./ |
||
Volcano |
|
- |
root@ubuntu:/home/ascend-hccl-controller# docker build --no-cache -t hccl-controller:v3.0.0 . Sending build context to Docker daemon 75.82MB Step 1/7 : FROM ubuntu:18.04 as build ---> 6526a1858e5d Step 2/7 : RUN useradd -d /home/hwMindX -u 9000 -m -s /usr/sbin/nologin hwMindX && usermod root -s /usr/sbin/nologin ---> Running in 982af8c5c938 Removing intermediate container 982af8c5c938 ---> ef8b3fbb5755 Step 3/7 : COPY ./hccl-controller /usr/local/bin/ ---> 44672aedff7d Step 4/7 : COPY ./lib /usr/local/lib ---> e99324c9bac7 Step 5/7 : RUN chown -R hwMindX:hwMindX /usr/local/bin && chown -R hwMindX:hwMindX /usr/local/lib && chmod 750 /home/hwMindX && chmod 550 /usr/local/bin/ && chmod 550 /usr/local/lib/ && chmod 500 /usr/local/lib/* && chmod 500 /usr/local/bin/hccl-controller && echo 'umask 027' >> /etc/profile && echo 'source /etc/profile' >> /home/hwMindX/.bashrc ---> Running in 75bd6c80f11b Removing intermediate container 75bd6c80f11b ---> 8e86333dee49 Step 6/7 : ENV LD_LIBRARY_PATH=/usr/local/lib ---> Running in a26ff750f047 Removing intermediate container a26ff750f047 ---> 37d48e8ec218 Step 7/7 : USER hwMindX ---> Running in d441db379be8 Removing intermediate container d441db379be8 ---> ef2acbb5a335 Successfully built ef2acbb5a335 Successfully tagged hccl-controller:v3.0.0
docker save hccl-controller:v3.0.0 > hccl-controller-v3.0.0-linux-arrch64.tar
scp hccl-controller-v3.0.0-linux-arrch64.tar root@{目标节点IP地址}:保存路径
docker load < hccl-controller-v3.0.0-linux-arrch64.tar