Creating an Image

This section uses HCCL-Controller as an example to describe how to create images for deploying MindX DL components in a container. The Dockerfile in the software package is for reference only. You can create a customized image based on this example. You can also obtain the MindX DL component images created based on Creating an Image from Ascend Hub. For details, see Pulling an Image from Ascend Hub. After the image is created, perform security hardening in a timely manner. For example, fix vulnerabilities in the base image and perform security hardening on the vulnerabilities caused by third-party dependencies.

Pulling an Image from Ascend Hub

  1. Ensure that the server can access the Internet and access Ascend Hub.
  2. In the left navigation bar, choose MindX DL. Then, select the image corresponding to the component according to the following table.
    Table 1 Image list

    Component

    Image Name

    Image Tag

    Node from Which Images Are Pulled

    Description

    Resilience-Controller

    resilience-controller

    v3.0.0

    Master node

    The pulled image can be deployed using the component startup YAML file only after it is renamed. For details, see 3.

    Volcano

    • volcanosh/vc-scheduler
    • volcanosh/vc-controller-manager

    v1.4.0-v3.0.0

    HCCL-Controller

    hccl-controller

    v3.0.0

    NodeD

    noded

    v3.0.0

    Worker node

    NPU-Exporter

    npu-exporter

    v3.0.0

    Ascend Device Plugin

    ascend-k8sdeviceplugin

    v3.0.0

    • If you do not have the download permission, apply for the permission as prompted. After your application is approved by the administrator, you can download the images.
    • If the docker login command fails to be executed, the possible cause is that no proxy is configured. For details about how to rectify the fault, see Configuring a Proxy for Logging In to the Ascend Hub.
  3. If the name of the MindX DL image pulled from Ascend Hub is different from that in the component startup YAML file, rename the pulled image before starting it. Perform the following steps to rename the image obtained in 2. You are advised to delete the image with the original name.
    1. Rename the image (select a corresponding command based on the component in use).
      docker tag ascendhub.huawei.com/public-ascendhub/resilience-controller:v3.0.0 resilience-controller:v3.0.0
      
      docker tag ascendhub.huawei.com/public-ascendhub/npu-exporter:v3.0.0 npu-exporter:v3.0.0
      
      docker tag ascendhub.huawei.com/public-ascendhub/ascend-k8sdeviceplugin:v3.0.0 ascend-k8sdeviceplugin:v3.0.0
      
      docker tag ascendhub.huawei.com/public-ascendhub/vc-controller-manager:v1.4.0-v3.0.0 volcanosh/vc-controller-manager:v1.4.0
      
      docker tag ascendhub.huawei.com/public-ascendhub/vc-scheduler:v1.4.0-v3.0.0 volcanosh/vc-scheduler:v1.4.0
      
      docker tag ascendhub.huawei.com/public-ascendhub/noded:v3.0.0 noded:v3.0.0
      
      docker tag ascendhub.huawei.com/public-ascendhub/hccl-controller:v3.0.0 hccl-controller:v3.0.0
    2. (Optional) Delete the original image (select the corresponding command based on the component in use).
      docker rmi ascendhub.huawei.com/public-ascendhub/resilience-controller:v3.0.0
      docker rmi ascendhub.huawei.com/public-ascendhub/npu-exporter:v3.0.0
      docker rmi ascendhub.huawei.com/public-ascendhub/ascend-k8sdeviceplugin:v3.0.0
      docker rmi ascendhub.huawei.com/public-ascendhub/vc-controller-manager:v1.4.0-v3.0.0
      docker rmi ascendhub.huawei.com/public-ascendhub/vc-scheduler:v1.4.0-v3.0.0
      docker rmi ascendhub.huawei.com/public-ascendhub/noded:v3.0.0
      docker rmi ascendhub.huawei.com/public-ascendhub/hccl-controller:v3.0.0

Creating an Image

  1. Obtain the cluster scheduling component software package to be installed. For details, see Software Package Description.
  2. Upload the software package to any directory on the image creation server and decompress it. Take HCCL-Controller as an example. Save the package to the /home/ascend-hccl-controller directory. The directory structure is as follows:
    root@ubuntu:/home/ascend-hccl-controller# ll
    total 66328
    drwxr-xr-x 3 root root     4096 Jun 24 20:24 ./
    drwxr-x--- 9 root root     4096 Jun 24 20:24 ../
    -r-x------ 1 root root 31313408 Jun 22 04:00 cert-importer*
    -r-------- 1 root root      677 Jun 22 04:00 Dockerfile
    -r-x------ 1 root root 36578912 Jun 22 04:00 hccl-controller*
    -r-------- 1 root root     2493 Jun 22 04:00 hccl-controller-v3.0.0.yaml
    -r-------- 1 root root     1611 Jun 22 04:00 hccl-controller-without-token-v3.0.0.yaml
    dr-xr-x--- 2 root root     4096 Jun 22 04:00 lib/

    If the NPU-Exporter and Ascend Device Plugin are deployed on the Atlas 200I Soc A1 core board as containers, you need to check the GID and UID of the HwHiAiUser, HwDmUser, and HwBaseUser users of the host when creating images, and record their values. Check whether the GID and UID values specified when the HwHiAiUser, HwDmUser, and HwBaseUser users are created in Dockerfile-310P-1usoc are the same as those on the host. If they are different, manually modify the Dockerfile-310P-1usoc file to make them consistent, and ensure that the GID and UID values of the HwHiAiUser, HwDmUser, and HwBaseUser users on each host are the same.

  3. Run the following commands to check whether the following base images exist on the nodes where the cluster scheduling component image is created:
    • Check the Ubuntu image. The image size of the ARM architecture is different from that of the x86 architecture.
      root@ubuntu:# docker images | grep ubuntu
      ubuntu              18.04               6526a1858e5d        2 years ago         64.2MB
    • If Volcano needs to be installed, check whether the alpine image exists. The following is an example. The image size of the ARM architecture is different from that of the x86 architecture.
      root@ubuntu:# docker images | grep alpine
      alpine            
        latest              a24bb4013296        2 years ago         5.57MB

    If the preceding base images do not exist, use commands in Table 2 to pull the base images. (To pull images, ensure that the server can connect to the Internet.)

    Table 2 Commands for obtaining base images

    Base Image

    Image Pulling Command

    Description

    ubuntu:18.04

    docker pull ubuntu:18.04

    The system architecture is automatically identified during image pulling.

    alpine:latest

    • x86

      docker pull alpine:latest

    • ARM

      docker pull arm64v8/alpine:latest

      docker tag arm64v8/alpine:latest alpine:latest

    -

  4. Go to the extracted component directories one by one and run the docker build commands to create images. For details about the commands, see the following table.
    Table 3 Commands for creating node images

    Node Type

    Component

    Image Creation Command

    Description

    Others

    Ascend Device Plugin

    docker build --no-cache -t ascend-k8sdeviceplugin:{tag} ./

    {tag} must be consistent with the software package version. For example, if the software package version is 3.0.0, the value of {tag} is v3.0.0.

    Atlas 200I Soc A1 core board

    docker build --no-cache -t ascend-k8sdeviceplugin:{tag} -f Dockerfile-310P-1usoc ./

    Others

    NPU-Exporter

    docker build --no-cache -t npu-exporter:{tag} ./

    Atlas 200I Soc A1 core board

    docker build --no-cache -t npu-exporter:{tag} -f Dockerfile-310P-1usoc ./

    Others

    HCCL-Controller

    docker build --no-cache -t hccl-controller:{tag} ./

    Resilience-Controller

    docker build --no-cache -t resilience-controller:{tag} ./

    NodeD

    docker build --no-cache -t noded:{tag} ./

    Volcano

    • docker build --no-cache -t volcanosh/vc-scheduler:v1.4.0 ./ -f ./Dockerfile-scheduler
    • docker build --no-cache -t volcanosh/vc-controller-manager:v1.4.0 ./ -f ./Dockerfile-controller

    -

    The following uses the HCCL-Controller component as an example to describe how to create an image.
    root@ubuntu:/home/ascend-hccl-controller# docker build --no-cache -t hccl-controller:v3.0.0 .
    Sending build context to Docker daemon  75.82MB
    Step 1/7 : FROM ubuntu:18.04 as build
     ---> 6526a1858e5d
    Step 2/7 : RUN useradd -d /home/hwMindX -u 9000 -m -s /usr/sbin/nologin hwMindX &&     usermod root -s /usr/sbin/nologin
     ---> Running in 982af8c5c938
    Removing intermediate container 982af8c5c938
     ---> ef8b3fbb5755
    Step 3/7 : COPY ./hccl-controller  /usr/local/bin/
     ---> 44672aedff7d
    Step 4/7 : COPY ./lib  /usr/local/lib
     ---> e99324c9bac7
    Step 5/7 : RUN chown -R hwMindX:hwMindX /usr/local/bin  &&    chown -R hwMindX:hwMindX /usr/local/lib &&    chmod 750 /home/hwMindX &&    chmod 550 /usr/local/bin/ &&    chmod 550 /usr/local/lib/ &&    chmod 500 /usr/local/lib/* &&    chmod 500 /usr/local/bin/hccl-controller &&    echo 'umask 027' >> /etc/profile &&     echo 'source /etc/profile' >> /home/hwMindX/.bashrc
     ---> Running in 75bd6c80f11b
    Removing intermediate container 75bd6c80f11b
     ---> 8e86333dee49
    Step 6/7 : ENV LD_LIBRARY_PATH=/usr/local/lib
     ---> Running in a26ff750f047
    Removing intermediate container a26ff750f047
     ---> 37d48e8ec218
    Step 7/7 : USER hwMindX
     ---> Running in d441db379be8
    Removing intermediate container d441db379be8
     ---> ef2acbb5a335
    Successfully built ef2acbb5a335
    Successfully tagged hccl-controller:v3.0.0
  5. Skip this step if the following conditions are met:
    • The created cluster scheduling component image has been uploaded to the private image repository. Each node can pull the image from the private image repository.
    • The component image has been created on each node where the cluster scheduling component is installed. You can use Table 1 to determine the component installation position.
    If the preceding conditions are not met, you need to manually distribute the image to each node. The following uses the HCCL-Controller component as an example to describe how to distribute an image to other nodes using the offline image package.
    1. Save the created image as an offline image.
      docker save hccl-controller:v3.0.0 > hccl-controller-v3.0.0-linux-arrch64.tar
    2. Copy the image to other nodes.
      scp hccl-controller-v3.0.0-linux-arrch64.tar root@{IP_address_of_the_target_node}:storage_path
    3. Log in to each node as the root user to load the offline image.
      docker load < hccl-controller-v3.0.0-linux-arrch64.tar