Creating an Image
This section uses HCCL-Controller as an example to describe how to create images for deploying MindX DL components in a container. The Dockerfile in the software package is for reference only. You can create a customized image based on this example. You can also obtain the MindX DL component images created based on Creating an Image from Ascend Hub. For details, see Pulling an Image from Ascend Hub. After the image is created, perform security hardening in a timely manner. For example, fix vulnerabilities in the base image and perform security hardening on the vulnerabilities caused by third-party dependencies.
Pulling an Image from Ascend Hub
- Ensure that the server can access the Internet and access Ascend Hub.
- In the left navigation bar, choose MindX DL. Then, select the image corresponding to the component according to the following table.
Table 1 Image list Component
Image Name
Image Tag
Node from Which Images Are Pulled
Description
Resilience-Controller
resilience-controller
v3.0.0
Master node
The pulled image can be deployed using the component startup YAML file only after it is renamed. For details, see 3.
Volcano
- volcanosh/vc-scheduler
- volcanosh/vc-controller-manager
v1.4.0-v3.0.0
HCCL-Controller
hccl-controller
v3.0.0
NodeD
noded
v3.0.0
Worker node
NPU-Exporter
npu-exporter
v3.0.0
Ascend Device Plugin
ascend-k8sdeviceplugin
v3.0.0
- If you do not have the download permission, apply for the permission as prompted. After your application is approved by the administrator, you can download the images.
- If the docker login command fails to be executed, the possible cause is that no proxy is configured. For details about how to rectify the fault, see Configuring a Proxy for Logging In to the Ascend Hub.
- If the name of the MindX DL image pulled from Ascend Hub is different from that in the component startup YAML file, rename the pulled image before starting it. Perform the following steps to rename the image obtained in 2. You are advised to delete the image with the original name.
- Rename the image (select a corresponding command based on the component in use).
docker tag ascendhub.huawei.com/public-ascendhub/resilience-controller:v3.0.0 resilience-controller:v3.0.0 docker tag ascendhub.huawei.com/public-ascendhub/npu-exporter:v3.0.0 npu-exporter:v3.0.0 docker tag ascendhub.huawei.com/public-ascendhub/ascend-k8sdeviceplugin:v3.0.0 ascend-k8sdeviceplugin:v3.0.0 docker tag ascendhub.huawei.com/public-ascendhub/vc-controller-manager:v1.4.0-v3.0.0 volcanosh/vc-controller-manager:v1.4.0 docker tag ascendhub.huawei.com/public-ascendhub/vc-scheduler:v1.4.0-v3.0.0 volcanosh/vc-scheduler:v1.4.0 docker tag ascendhub.huawei.com/public-ascendhub/noded:v3.0.0 noded:v3.0.0 docker tag ascendhub.huawei.com/public-ascendhub/hccl-controller:v3.0.0 hccl-controller:v3.0.0
- (Optional) Delete the original image (select the corresponding command based on the component in use).
docker rmi ascendhub.huawei.com/public-ascendhub/resilience-controller:v3.0.0 docker rmi ascendhub.huawei.com/public-ascendhub/npu-exporter:v3.0.0 docker rmi ascendhub.huawei.com/public-ascendhub/ascend-k8sdeviceplugin:v3.0.0 docker rmi ascendhub.huawei.com/public-ascendhub/vc-controller-manager:v1.4.0-v3.0.0 docker rmi ascendhub.huawei.com/public-ascendhub/vc-scheduler:v1.4.0-v3.0.0 docker rmi ascendhub.huawei.com/public-ascendhub/noded:v3.0.0 docker rmi ascendhub.huawei.com/public-ascendhub/hccl-controller:v3.0.0
- Rename the image (select a corresponding command based on the component in use).
Creating an Image
- Obtain the cluster scheduling component software package to be installed. For details, see Software Package Description.
- Upload the software package to any directory on the image creation server and decompress it. Take HCCL-Controller as an example. Save the package to the /home/ascend-hccl-controller directory. The directory structure is as follows:
root@ubuntu:/home/ascend-hccl-controller# ll total 66328 drwxr-xr-x 3 root root 4096 Jun 24 20:24 ./ drwxr-x--- 9 root root 4096 Jun 24 20:24 ../ -r-x------ 1 root root 31313408 Jun 22 04:00 cert-importer* -r-------- 1 root root 677 Jun 22 04:00 Dockerfile -r-x------ 1 root root 36578912 Jun 22 04:00 hccl-controller* -r-------- 1 root root 2493 Jun 22 04:00 hccl-controller-v3.0.0.yaml -r-------- 1 root root 1611 Jun 22 04:00 hccl-controller-without-token-v3.0.0.yaml dr-xr-x--- 2 root root 4096 Jun 22 04:00 lib/
If the NPU-Exporter and Ascend Device Plugin are deployed on the Atlas 200I Soc A1 core board as containers, you need to check the GID and UID of the HwHiAiUser, HwDmUser, and HwBaseUser users of the host when creating images, and record their values. Check whether the GID and UID values specified when the HwHiAiUser, HwDmUser, and HwBaseUser users are created in Dockerfile-310P-1usoc are the same as those on the host. If they are different, manually modify the Dockerfile-310P-1usoc file to make them consistent, and ensure that the GID and UID values of the HwHiAiUser, HwDmUser, and HwBaseUser users on each host are the same.
- Run the following commands to check whether the following base images exist on the nodes where the cluster scheduling component image is created:
- Check the Ubuntu image. The image size of the ARM architecture is different from that of the x86 architecture.
root@ubuntu:# docker images | grep ubuntu ubuntu 18.04 6526a1858e5d 2 years ago 64.2MB
- If Volcano needs to be installed, check whether the alpine image exists. The following is an example. The image size of the ARM architecture is different from that of the x86 architecture.
root@ubuntu:# docker images | grep alpine alpine latest a24bb4013296 2 years ago 5.57MB
If the preceding base images do not exist, use commands in Table 2 to pull the base images. (To pull images, ensure that the server can connect to the Internet.)
- Check the Ubuntu image. The image size of the ARM architecture is different from that of the x86 architecture.
- Go to the extracted component directories one by one and run the docker build commands to create images. For details about the commands, see the following table.
Table 3 Commands for creating node images Node Type
Component
Image Creation Command
Description
Others
Ascend Device Plugin
docker build --no-cache -t ascend-k8sdeviceplugin:{tag} ./
{tag} must be consistent with the software package version. For example, if the software package version is 3.0.0, the value of {tag} is v3.0.0.
Atlas 200I Soc A1 core board
docker build --no-cache -t ascend-k8sdeviceplugin:{tag} -f Dockerfile-310P-1usoc ./
Others
NPU-Exporter
docker build --no-cache -t npu-exporter:{tag} ./
Atlas 200I Soc A1 core board
docker build --no-cache -t npu-exporter:{tag} -f Dockerfile-310P-1usoc ./
Others
HCCL-Controller
docker build --no-cache -t hccl-controller:{tag} ./
Resilience-Controller
docker build --no-cache -t resilience-controller:{tag} ./
NodeD
docker build --no-cache -t noded:{tag} ./
Volcano
- docker build --no-cache -t volcanosh/vc-scheduler:v1.4.0 ./ -f ./Dockerfile-scheduler
- docker build --no-cache -t volcanosh/vc-controller-manager:v1.4.0 ./ -f ./Dockerfile-controller
-
The following uses the HCCL-Controller component as an example to describe how to create an image.root@ubuntu:/home/ascend-hccl-controller# docker build --no-cache -t hccl-controller:v3.0.0 . Sending build context to Docker daemon 75.82MB Step 1/7 : FROM ubuntu:18.04 as build ---> 6526a1858e5d Step 2/7 : RUN useradd -d /home/hwMindX -u 9000 -m -s /usr/sbin/nologin hwMindX && usermod root -s /usr/sbin/nologin ---> Running in 982af8c5c938 Removing intermediate container 982af8c5c938 ---> ef8b3fbb5755 Step 3/7 : COPY ./hccl-controller /usr/local/bin/ ---> 44672aedff7d Step 4/7 : COPY ./lib /usr/local/lib ---> e99324c9bac7 Step 5/7 : RUN chown -R hwMindX:hwMindX /usr/local/bin && chown -R hwMindX:hwMindX /usr/local/lib && chmod 750 /home/hwMindX && chmod 550 /usr/local/bin/ && chmod 550 /usr/local/lib/ && chmod 500 /usr/local/lib/* && chmod 500 /usr/local/bin/hccl-controller && echo 'umask 027' >> /etc/profile && echo 'source /etc/profile' >> /home/hwMindX/.bashrc ---> Running in 75bd6c80f11b Removing intermediate container 75bd6c80f11b ---> 8e86333dee49 Step 6/7 : ENV LD_LIBRARY_PATH=/usr/local/lib ---> Running in a26ff750f047 Removing intermediate container a26ff750f047 ---> 37d48e8ec218 Step 7/7 : USER hwMindX ---> Running in d441db379be8 Removing intermediate container d441db379be8 ---> ef2acbb5a335 Successfully built ef2acbb5a335 Successfully tagged hccl-controller:v3.0.0
- Skip this step if the following conditions are met:
- The created cluster scheduling component image has been uploaded to the private image repository. Each node can pull the image from the private image repository.
- The component image has been created on each node where the cluster scheduling component is installed. You can use Table 1 to determine the component installation position.
If the preceding conditions are not met, you need to manually distribute the image to each node. The following uses the HCCL-Controller component as an example to describe how to distribute an image to other nodes using the offline image package.- Save the created image as an offline image.
docker save hccl-controller:v3.0.0 > hccl-controller-v3.0.0-linux-arrch64.tar
- Copy the image to other nodes.
scp hccl-controller-v3.0.0-linux-arrch64.tar root@{IP_address_of_the_target_node}:storage_path - Log in to each node as the root user to load the offline image.
docker load < hccl-controller-v3.0.0-linux-arrch64.tar