Rec SDK TensorFlow Training Image Building

This section guides users on how to build a Rec SDK TensorFlow training image based on an existing base image.

Prerequisite

  1. The driver and firmware that match the CANN version have been installed on the physical machine. For details, see Dependency Installation.
  2. Docker has been installed on the physical machine, and Docker can access the network.
  3. Prepare a base image. You can obtain a base image from Ascend Hub or use an existing base image.
    • (Recommended) Obtain the Rec SDK TensorFlow training image from Ascend Hub. These images pre-install core dependencies like GCC and CMake. You only need to update the CANN and Rec SDK software packages within the image.
    • Obtain a CentOS 7.6.1810 image from Ascend Hub. If you provide your own base image, CentOS 7.6.1810 is recommended.
  4. Run the following command to load the base image into Docker:
    docker load --input xxx.tar
  5. Create a directory for image building (for example, build_images).
    1. Place only the necessary files for the build process into this directory, such as the architecture-specific Ascend-cann-toolkit_*.run, tfplugin, and Rec SDK packages.
    2. If installing the tfplugin package is needed, also copy /usr/local/Ascend/driver/version.info and /etc/ascend_install.info to the build_images directory.

      (Do not place unrelated files in the build_images directory, as all contents will be copied into the image during the build process.)

  6. The build process involves Docker commands and file copying. Ensure you have the required execution and file access permissions.

Using the Rec SDK Base Image to Build a Training Image

  1. Obtain the Rec SDK package, CANN package, and Ascend-adapted TensorFlow plugin by referring to Downloading the Rec SDK TensorFlow Package.
  2. Create a Dockerfile configuration file (for example, Dockerfile) in the build_images directory, run the vi Dockerfile command to edit the file, and insert the following content:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    # Change the base image name and image tag accordingly.
    FROM rec_sdk-tf1:7.2.RC1
    
    # Set CANN parameters.
    ARG TOOLKIT_PKG=Ascend-cann-toolkit*.run
    ARG KERNEL_PKG=Ascend-cann-*-ops*.run
    ARG TF1_PLUGIN=npu_bridge-1.15.0-*.whl
    ARG TF2_PLUGIN=npu_device-2.6.5-*.whl
    # Rec SDK TensorFlow package
    ARG REC_SDK_PKG=Ascend-mindxsdk-mxrec*.tar.gz
    
    # Set the environment variable of the installation path.
    ARG ASCEND_BASE=/usr/local/Ascend
    
    # Delete the old CANN.
    RUN rm -rf $ASCEND_BASE/ascend-toolkit
    RUN rm -rf $ASCEND_BASE/cann*
    
    # Copy and install the dependencies as required. If you do not need to perform this operation, delete the command line.
    WORKDIR /tmp
    COPY $TOOLKIT_PKG .
    COPY $KERNEL_PKG .
    COPY $TF1_PLUGIN .
    COPY $TF2_PLUGIN .
    COPY $REC_SDK_PKG .
    COPY version.info .
    COPY ascend_install.info .
    
    # Install the ascend-toolkit and operator packages.
    RUN umask 0027 && \
        mkdir -p $ASCEND_BASE/driver && \
        /usr/bin/cp -f version.info $ASCEND_BASE/driver/ && \
        /usr/bin/cp -f ascend_install.info /etc/ && \
        chmod +x $TOOLKIT_PKG && \
        echo Y | bash $TOOLKIT_PKG --quiet --install --install-path=$ASCEND_BASE && \
        chmod +x $KERNEL_PKG && \
        echo Y | bash $KERNEL_PKG --quiet --install && \
        source $ASCEND_BASE/cann/set_env.sh && \
        rm -rf /root/.cache/pip && \
        rm -f $TOOLKIT_PKG && \
        rm -f $KERNEL_PKG && \
        rm -rf $ASCEND_BASE/driver && \
        rm -rf /etc/ascend_install.info
    
    # Install the Rec SDK TensorFlow package. The following uses the tf1 version as an example. (To install the Rec SDK package of the tf2 version, change tf1 to tf2.)
    ARG SDK_TF_VERSION=tf1
    RUN pip3.7 install $TF1_PLUGIN --force-reinstall && \
        pip3.7 install $TF2_PLUGIN --force-reinstall && \
    tar -zxvf Ascend-mindxsdk-mxrec*.tar.gz && \
        pip3.7 install mindxsdk-mxrec/${SDK_TF_VERSION}_whl/mx_rec-*.whl --force-reinstall && \
        rm -rf /root/.cache/pip
    
  3. Go to the build_images directory and run the following command to build a Rec SDK TensorFlow image:
    docker build -t {Image name}:{Image tag} -f Dockerfile.

Building a Training Image Based on CentOS 7.6.1810 or a User Image

  1. Confirm the following dependencies and download the uninstalled dependency packages to the same directory.

    Dependency

    Download URL

    GCC 7.3.0

    Link

    Cmake 3.20.6

    Link

    UCX

    Link

    openMPI 4.1.5

    Link

    Python 3.7.5

    Link

    HDF5 1.10.5

    Link

    CANN package, Ascend-adapted TensorFlow plugin, and Rec SDK package

    Dependency Installation

    TensorFlow (1.15.0/2.6.5)

    Link

  2. Create a Dockerfile configuration file (for example, Dockerfile) in the build_images directory, run the vi Dockerfile command to edit the file, and insert the following content:
      1
      2
      3
      4
      5
      6
      7
      8
      9
     10
     11
     12
     13
     14
     15
     16
     17
     18
     19
     20
     21
     22
     23
     24
     25
     26
     27
     28
     29
     30
     31
     32
     33
     34
     35
     36
     37
     38
     39
     40
     41
     42
     43
     44
     45
     46
     47
     48
     49
     50
     51
     52
     53
     54
     55
     56
     57
     58
     59
     60
     61
     62
     63
     64
     65
     66
     67
     68
     69
     70
     71
     72
     73
     74
     75
     76
     77
     78
     79
     80
     81
     82
     83
     84
     85
     86
     87
     88
     89
     90
     91
     92
     93
     94
     95
     96
     97
     98
     99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    166
    167
    168
    169
    170
    171
    172
    173
    174
    175
    176
    177
    178
    179
    180
    181
    # Change the base image name and image tag accordingly.
    FROM swr.cn-south-1.myhuaweicloud.com/ascendhub/centos:7.6.1810
    
    WORKDIR /tmp
    
    
    # Select the dependencies to be installed as required. If some dependencies are not required, delete or comment out the corresponding code. In addition, ensure that the name of the downloaded dependency package is the same as that in the following code.
    # Otherwise, an error indicating that the file cannot be found may occur during dependency installation.
    COPY gcc-7.3.0.tar.gz ./
    COPY cmake-3.20.6.tar.gz ./
    COPY ucx-master.zip ./
    COPY openmpi-4.1.5.tar.gz ./
    COPY Python-3.7.5.tar.xz ./
    COPY hdf5-1.10.5.tar.gz ./
    
    COPY Ascend-cann-toolkit*.run ./
    COPY Ascend-cann-*-ops*.run ./
    COPY version.info ./
    COPY ascend_install.info ./
    COPY ./npu_bridge-1.15.0-*.whl ./
    COPY ./npu_device-2.6.5-*.whl ./
    COPY Ascend-mindxsdk-mxrec*.tar.gz ./
    
    # 1. Install the compilation environment.
    RUN yum makecache && \
        yum -y install centos-release-scl && \
        yum -y install devtoolset-7 && \
        yum -y install devtoolset-7-gcc-c++ && \
        yum -y install epel-release && \
        yum -y install wget zlib-devel bzip2 bzip2-devel openssl-devel ncurses-devel openssh-clients openssh-server sqlite-devel openmpi-devel \
        readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel libffi-devel hdf5-devel patch pciutils lcov vim dos2unix gcc-c++ \
        autoconf automake libtool git net-tools make sudo unzip && \
        yum clean all && \
        rm -rf /var/cache/yum && \
        echo "source /opt/rh/devtoolset-7/enable" >> /etc/profile
    # Note: openssh-server is required for two-server training and can be deleted only for single-server training.
    
    # 2. Install GCC 7.3.0.
    RUN source /etc/profile && \
        tar -zxvf gcc-7.3.0.tar.gz && \
        cd gcc-7.3.0 && \
        wget https://mirrors.huaweicloud.com/gnu/gmp/gmp-6.1.0.tar.bz2 && \
        wget https://mirrors.huaweicloud.com/gnu/mpfr/mpfr-3.1.4.tar.bz2 && \
        wget https://mirrors.huaweicloud.com/gnu/mpc/mpc-1.0.3.tar.gz && \
        wget https://mindx.obs.cn-south-1.myhuaweicloud.com/opensource/isl-0.16.1.tar.bz2 && \
        sed -i "246s/tar -xf "${ar}"/tar --no-same-owner -xf "${ar}"/" contrib/download_prerequisites && \
        ./contrib/download_prerequisites && \
        ./configure --enable-languages=c,c++ --disable-multilib --with-system-zlib --prefix=/usr/local/gcc7.3.0 && \
        make -j && make -j install && cd .. && \
        find gcc-7.3.0/ -name libstdc++.so.6.0.24 -exec cp {} /lib64/ \; && \
        rm -rf gcc-7.3.0*
    
    ENV LD_LIBRARY_PATH=/usr/local/gcc7.3.0/lib64:$LD_LIBRARY_PATH \
        PATH=/usr/local/gcc7.3.0/bin:$PATH
    
    # 3. Install CMake.
    RUN source /etc/profile && gcc -v && tar -zxf cmake-3.20.6.tar.gz && \
        cd cmake-3.20.6 && \
        ./bootstrap && make && make install && cd .. && \
        rm -rf cmake-3.20.6*
    
    # 4. Install UCX.
    RUN source /etc/profile && gcc -v && unzip ucx-master.zip && \
        cd ucx-master && \
        ./autogen.sh && \
        ./contrib/configure-release --prefix=/usr/local/ucx && \
        make && make install && cd .. && \
        rm -rf ucx-master*
    
    # 5. Install OpenMPI and configure UCX.
    RUN source /etc/profile && gcc -v && tar -zxvf openmpi-4.1.5.tar.gz && \
        cd openmpi-4.1.5 && \
        ./configure --enable-orterun-prefix-by-default --prefix=/usr/local/openmpi --with-ucx=/usr/local/ucx && \
        make -j 16 && make install && cd .. && \
        rm -rf openmpi-4.1.5*
    
    ENV LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH \
        PATH=/usr/local/openmpi/bin:$PATH
    
    # 6. Install Python 3.7.5.
    RUN source /etc/profile && gcc -v && tar -xvf Python-3.7.5.tar.xz && \
        cd Python-3.7.5 && \
        mkdir -p build && cd build && \
        ../configure --enable-shared --prefix=/usr/local/python3.7.5 && \
        make -j && make install && \
        cd ../../ && rm -rf Python-3.7.5* && \
        ldconfig
    
    ENV PATH=$PATH:/usr/local/python3.7.5/bin \
        LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/python3.7.5/lib
    
    # Configure the Python source.
    RUN mkdir ~/.pip && touch ~/.pip/pip.conf && \
        echo "[global]" > ~/.pip/pip.conf && \
        echo "trusted-host=pypi.douban.com" >> ~/.pip/pip.conf && \
        echo "index-url=http://pypi.douban.com/simple/" >> ~/.pip/pip.conf && \
        echo "timeout=200" >> ~/.pip/pip.conf
    
    # 7. Install HDF5.
    RUN source /etc/profile && gcc -v && tar -zxvf hdf5-1.10.5.tar.gz && \
        cd hdf5-1.10.5 && \
        ./configure --prefix=/usr/local/hdf5 && \
        make && make install && cd .. && rm -rf hdf5-1.10.5*
    
    ENV CPATH=/usr/local/hdf5/include/:/usr/local/hdf5/lib/
    
    RUN ln -s /usr/local/hdf5/lib/libhdf5.so /usr/lib/libhdf5.so && \
        ln -s /usr/local/hdf5/lib/libhdf5_hl.so /usr/lib/libhdf5_hl.so
    
    # 8. Install the Python package.
    When installing mpi4py, use this environment variable. After the installation is complete, unset it.
    ENV CC=/usr/lib64/openmpi/bin/mpicc
    
    RUN pip3.7 install -U pip && \
        pip3.7 install numpy && \
        pip3.7 install decorator && \
        pip3.7 install sympy==1.4 && \
        pip3.7 install cffi==1.12.3 && \
        pip3.7 install pyyaml && \
        pip3.7 install pathlib2 && \
        pip3.7 install grpcio && \
        pip3.7 install grpcio-tools && \
        pip3.7 install protobuf==3.20.0 && \
        pip3.7 install scipy && \
        pip3.7 install requests && \
        pip3.7 install mpi4py && \
        pip3.7 install scikit-learn && \
        pip3.7 install easydict && \
        pip3.7 install attrs && \
        pip3.7 install pytest==7.1.1 && \
        pip3.7 install pytest-cov==4.1.0 && \
        pip3.7 install pytest-html && \
        pip3.7 install Cython && \
        pip3.7 install h5py==3.1.0 && \
        pip3.7 install pandas && \
        pip3.7 install funcsigs && \
        pip3.7 install tqdm && \
        pip3.7 install portalocker && \
        rm -rf /root/.cache/pip
    
    RUN unset CC
    
    # 9. Set the environment variable of the driver path.
    ARG ASCEND_BASE=/usr/local/Ascend
    ENV LD_LIBRARY_PATH=$ASCEND_BASE/driver/lib64:$ASCEND_BASE/driver/lib64/common:$ASCEND_BASE/driver/lib64/driver:$LD_LIBRARY_PATH
    
    # 10. Set CANN-related parameters.
    ARG TOOLKIT_PKG=Ascend-cann-toolkit*.run
    ARG KERNEL_PKG=Ascend-cann-*-ops*.run
    
    # 11. Install the ascend-toolkit and operator packages.
    RUN umask 0027 && \
        mkdir -p $ASCEND_BASE/driver && \
        /usr/bin/cp -f version.info $ASCEND_BASE/driver/ && \
        /usr/bin/cp -f ascend_install.info /etc/ && \
        chmod +x $TOOLKIT_PKG && \
        echo Y | bash $TOOLKIT_PKG --quiet --install --install-path=$ASCEND_BASE && \
        chmod +x $KERNEL_PKG && \
        echo Y | bash $KERNEL_PKG --quiet --install && \
        source $ASCEND_BASE/cann/set_env.sh && \
        rm -rf /root/.cache/pip && \
        rm -f $TOOLKIT_PKG && \
        rm -f $KERNEL_PKG && \
        rm -rf $ASCEND_BASE/driver && \
        rm -rf /etc/ascend_install.info
    
    # 12. Install the TensorFlow-related Python packages and Rec SDK. # By default, the tf1 image is built. To build the tf2 image, modify the parameters.
    ARG TF_VER=1.15.0
    ARG TF1_PLUGIN=npu_bridge-1.15.0-*.whl
    ARG TF2_PLUGIN=npu_device-2.6.5-*.whl 
    RUN pip3.7 install tensorflow==${TF_VER} && \
        pip3.7 install tf_slim && \
        HOROVOD_WITH_MPI=1 HOROVOD_WITH_TENSORFLOW=1 pip3.7 install horovod --no-cache-dir && \
        pip3.7 install $TF1_PLUGIN --force-reinstall && \
        pip3.7 install $TF2_PLUGIN --force-reinstall && \
        tar -zxvf Ascend-mindxsdk-mxrec*.tar.gz && \
        pip3.7 install mindxsdk-mxrec/{tf1|tf2}_whl/mx_rec-*.whl --force-reinstall && \
        rm -rf /root/.cache/pip
    
    # 13. Clear the temporary directory.
    RUN rm -rf ./*
    
  3. Go to the build_images directory and run the following command to build a Rec SDK TensorFlow image:
    docker build -t {Image name}:{Image tag} -f Dockerfile.