Rec SDK TensorFlow Training Image Building
This section guides users on how to build a Rec SDK TensorFlow training image based on an existing base image.
Prerequisite
- The driver and firmware that match the CANN version have been installed on the physical machine. For details, see Dependency Installation.
- Docker has been installed on the physical machine, and Docker can access the network.
- Prepare a base image. You can obtain a base image from Ascend Hub or use an existing base image.
- (Recommended) Obtain the Rec SDK TensorFlow training image from Ascend Hub. These images pre-install core dependencies like GCC and CMake. You only need to update the CANN and Rec SDK software packages within the image.
- Obtain a CentOS 7.6.1810 image from Ascend Hub. If you provide your own base image, CentOS 7.6.1810 is recommended.
- Run the following command to load the base image into Docker:
docker load --input xxx.tar
- Create a directory for image building (for example, build_images).
- Place only the necessary files for the build process into this directory, such as the architecture-specific Ascend-cann-toolkit_*.run, tfplugin, and Rec SDK packages.
- If installing the tfplugin package is needed, also copy /usr/local/Ascend/driver/version.info and /etc/ascend_install.info to the build_images directory.
(Do not place unrelated files in the build_images directory, as all contents will be copied into the image during the build process.)
- The build process involves Docker commands and file copying. Ensure you have the required execution and file access permissions.
Using the Rec SDK Base Image to Build a Training Image
- Obtain the Rec SDK package, CANN package, and Ascend-adapted TensorFlow plugin by referring to Downloading the Rec SDK TensorFlow Package.
- Create a Dockerfile configuration file (for example, Dockerfile) in the build_images directory, run the vi Dockerfile command to edit the file, and insert the following content:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51
# Change the base image name and image tag accordingly. FROM rec_sdk-tf1:7.2.RC1 # Set CANN parameters. ARG TOOLKIT_PKG=Ascend-cann-toolkit*.run ARG KERNEL_PKG=Ascend-cann-*-ops*.run ARG TF1_PLUGIN=npu_bridge-1.15.0-*.whl ARG TF2_PLUGIN=npu_device-2.6.5-*.whl # Rec SDK TensorFlow package ARG REC_SDK_PKG=Ascend-mindxsdk-mxrec*.tar.gz # Set the environment variable of the installation path. ARG ASCEND_BASE=/usr/local/Ascend # Delete the old CANN. RUN rm -rf $ASCEND_BASE/ascend-toolkit RUN rm -rf $ASCEND_BASE/cann* # Copy and install the dependencies as required. If you do not need to perform this operation, delete the command line. WORKDIR /tmp COPY $TOOLKIT_PKG . COPY $KERNEL_PKG . COPY $TF1_PLUGIN . COPY $TF2_PLUGIN . COPY $REC_SDK_PKG . COPY version.info . COPY ascend_install.info . # Install the ascend-toolkit and operator packages. RUN umask 0027 && \ mkdir -p $ASCEND_BASE/driver && \ /usr/bin/cp -f version.info $ASCEND_BASE/driver/ && \ /usr/bin/cp -f ascend_install.info /etc/ && \ chmod +x $TOOLKIT_PKG && \ echo Y | bash $TOOLKIT_PKG --quiet --install --install-path=$ASCEND_BASE && \ chmod +x $KERNEL_PKG && \ echo Y | bash $KERNEL_PKG --quiet --install && \ source $ASCEND_BASE/cann/set_env.sh && \ rm -rf /root/.cache/pip && \ rm -f $TOOLKIT_PKG && \ rm -f $KERNEL_PKG && \ rm -rf $ASCEND_BASE/driver && \ rm -rf /etc/ascend_install.info # Install the Rec SDK TensorFlow package. The following uses the tf1 version as an example. (To install the Rec SDK package of the tf2 version, change tf1 to tf2.) ARG SDK_TF_VERSION=tf1 RUN pip3.7 install $TF1_PLUGIN --force-reinstall && \ pip3.7 install $TF2_PLUGIN --force-reinstall && \ tar -zxvf Ascend-mindxsdk-mxrec*.tar.gz && \ pip3.7 install mindxsdk-mxrec/${SDK_TF_VERSION}_whl/mx_rec-*.whl --force-reinstall && \ rm -rf /root/.cache/pip
- Go to the build_images directory and run the following command to build a Rec SDK TensorFlow image:
docker build -t {Image name}:{Image tag} -f Dockerfile.
Building a Training Image Based on CentOS 7.6.1810 or a User Image
- Confirm the following dependencies and download the uninstalled dependency packages to the same directory.
Dependency
Download URL
GCC 7.3.0
Cmake 3.20.6
UCX
openMPI 4.1.5
Python 3.7.5
HDF5 1.10.5
CANN package, Ascend-adapted TensorFlow plugin, and Rec SDK package
TensorFlow (1.15.0/2.6.5)
- Create a Dockerfile configuration file (for example, Dockerfile) in the build_images directory, run the vi Dockerfile command to edit the file, and insert the following content:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181
# Change the base image name and image tag accordingly. FROM swr.cn-south-1.myhuaweicloud.com/ascendhub/centos:7.6.1810 WORKDIR /tmp # Select the dependencies to be installed as required. If some dependencies are not required, delete or comment out the corresponding code. In addition, ensure that the name of the downloaded dependency package is the same as that in the following code. # Otherwise, an error indicating that the file cannot be found may occur during dependency installation. COPY gcc-7.3.0.tar.gz ./ COPY cmake-3.20.6.tar.gz ./ COPY ucx-master.zip ./ COPY openmpi-4.1.5.tar.gz ./ COPY Python-3.7.5.tar.xz ./ COPY hdf5-1.10.5.tar.gz ./ COPY Ascend-cann-toolkit*.run ./ COPY Ascend-cann-*-ops*.run ./ COPY version.info ./ COPY ascend_install.info ./ COPY ./npu_bridge-1.15.0-*.whl ./ COPY ./npu_device-2.6.5-*.whl ./ COPY Ascend-mindxsdk-mxrec*.tar.gz ./ # 1. Install the compilation environment. RUN yum makecache && \ yum -y install centos-release-scl && \ yum -y install devtoolset-7 && \ yum -y install devtoolset-7-gcc-c++ && \ yum -y install epel-release && \ yum -y install wget zlib-devel bzip2 bzip2-devel openssl-devel ncurses-devel openssh-clients openssh-server sqlite-devel openmpi-devel \ readline-devel tk-devel gdbm-devel db4-devel libpcap-devel xz-devel libffi-devel hdf5-devel patch pciutils lcov vim dos2unix gcc-c++ \ autoconf automake libtool git net-tools make sudo unzip && \ yum clean all && \ rm -rf /var/cache/yum && \ echo "source /opt/rh/devtoolset-7/enable" >> /etc/profile # Note: openssh-server is required for two-server training and can be deleted only for single-server training. # 2. Install GCC 7.3.0. RUN source /etc/profile && \ tar -zxvf gcc-7.3.0.tar.gz && \ cd gcc-7.3.0 && \ wget https://mirrors.huaweicloud.com/gnu/gmp/gmp-6.1.0.tar.bz2 && \ wget https://mirrors.huaweicloud.com/gnu/mpfr/mpfr-3.1.4.tar.bz2 && \ wget https://mirrors.huaweicloud.com/gnu/mpc/mpc-1.0.3.tar.gz && \ wget https://mindx.obs.cn-south-1.myhuaweicloud.com/opensource/isl-0.16.1.tar.bz2 && \ sed -i "246s/tar -xf "${ar}"/tar --no-same-owner -xf "${ar}"/" contrib/download_prerequisites && \ ./contrib/download_prerequisites && \ ./configure --enable-languages=c,c++ --disable-multilib --with-system-zlib --prefix=/usr/local/gcc7.3.0 && \ make -j && make -j install && cd .. && \ find gcc-7.3.0/ -name libstdc++.so.6.0.24 -exec cp {} /lib64/ \; && \ rm -rf gcc-7.3.0* ENV LD_LIBRARY_PATH=/usr/local/gcc7.3.0/lib64:$LD_LIBRARY_PATH \ PATH=/usr/local/gcc7.3.0/bin:$PATH # 3. Install CMake. RUN source /etc/profile && gcc -v && tar -zxf cmake-3.20.6.tar.gz && \ cd cmake-3.20.6 && \ ./bootstrap && make && make install && cd .. && \ rm -rf cmake-3.20.6* # 4. Install UCX. RUN source /etc/profile && gcc -v && unzip ucx-master.zip && \ cd ucx-master && \ ./autogen.sh && \ ./contrib/configure-release --prefix=/usr/local/ucx && \ make && make install && cd .. && \ rm -rf ucx-master* # 5. Install OpenMPI and configure UCX. RUN source /etc/profile && gcc -v && tar -zxvf openmpi-4.1.5.tar.gz && \ cd openmpi-4.1.5 && \ ./configure --enable-orterun-prefix-by-default --prefix=/usr/local/openmpi --with-ucx=/usr/local/ucx && \ make -j 16 && make install && cd .. && \ rm -rf openmpi-4.1.5* ENV LD_LIBRARY_PATH=/usr/local/openmpi/lib:$LD_LIBRARY_PATH \ PATH=/usr/local/openmpi/bin:$PATH # 6. Install Python 3.7.5. RUN source /etc/profile && gcc -v && tar -xvf Python-3.7.5.tar.xz && \ cd Python-3.7.5 && \ mkdir -p build && cd build && \ ../configure --enable-shared --prefix=/usr/local/python3.7.5 && \ make -j && make install && \ cd ../../ && rm -rf Python-3.7.5* && \ ldconfig ENV PATH=$PATH:/usr/local/python3.7.5/bin \ LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/python3.7.5/lib # Configure the Python source. RUN mkdir ~/.pip && touch ~/.pip/pip.conf && \ echo "[global]" > ~/.pip/pip.conf && \ echo "trusted-host=pypi.douban.com" >> ~/.pip/pip.conf && \ echo "index-url=http://pypi.douban.com/simple/" >> ~/.pip/pip.conf && \ echo "timeout=200" >> ~/.pip/pip.conf # 7. Install HDF5. RUN source /etc/profile && gcc -v && tar -zxvf hdf5-1.10.5.tar.gz && \ cd hdf5-1.10.5 && \ ./configure --prefix=/usr/local/hdf5 && \ make && make install && cd .. && rm -rf hdf5-1.10.5* ENV CPATH=/usr/local/hdf5/include/:/usr/local/hdf5/lib/ RUN ln -s /usr/local/hdf5/lib/libhdf5.so /usr/lib/libhdf5.so && \ ln -s /usr/local/hdf5/lib/libhdf5_hl.so /usr/lib/libhdf5_hl.so # 8. Install the Python package. When installing mpi4py, use this environment variable. After the installation is complete, unset it. ENV CC=/usr/lib64/openmpi/bin/mpicc RUN pip3.7 install -U pip && \ pip3.7 install numpy && \ pip3.7 install decorator && \ pip3.7 install sympy==1.4 && \ pip3.7 install cffi==1.12.3 && \ pip3.7 install pyyaml && \ pip3.7 install pathlib2 && \ pip3.7 install grpcio && \ pip3.7 install grpcio-tools && \ pip3.7 install protobuf==3.20.0 && \ pip3.7 install scipy && \ pip3.7 install requests && \ pip3.7 install mpi4py && \ pip3.7 install scikit-learn && \ pip3.7 install easydict && \ pip3.7 install attrs && \ pip3.7 install pytest==7.1.1 && \ pip3.7 install pytest-cov==4.1.0 && \ pip3.7 install pytest-html && \ pip3.7 install Cython && \ pip3.7 install h5py==3.1.0 && \ pip3.7 install pandas && \ pip3.7 install funcsigs && \ pip3.7 install tqdm && \ pip3.7 install portalocker && \ rm -rf /root/.cache/pip RUN unset CC # 9. Set the environment variable of the driver path. ARG ASCEND_BASE=/usr/local/Ascend ENV LD_LIBRARY_PATH=$ASCEND_BASE/driver/lib64:$ASCEND_BASE/driver/lib64/common:$ASCEND_BASE/driver/lib64/driver:$LD_LIBRARY_PATH # 10. Set CANN-related parameters. ARG TOOLKIT_PKG=Ascend-cann-toolkit*.run ARG KERNEL_PKG=Ascend-cann-*-ops*.run # 11. Install the ascend-toolkit and operator packages. RUN umask 0027 && \ mkdir -p $ASCEND_BASE/driver && \ /usr/bin/cp -f version.info $ASCEND_BASE/driver/ && \ /usr/bin/cp -f ascend_install.info /etc/ && \ chmod +x $TOOLKIT_PKG && \ echo Y | bash $TOOLKIT_PKG --quiet --install --install-path=$ASCEND_BASE && \ chmod +x $KERNEL_PKG && \ echo Y | bash $KERNEL_PKG --quiet --install && \ source $ASCEND_BASE/cann/set_env.sh && \ rm -rf /root/.cache/pip && \ rm -f $TOOLKIT_PKG && \ rm -f $KERNEL_PKG && \ rm -rf $ASCEND_BASE/driver && \ rm -rf /etc/ascend_install.info # 12. Install the TensorFlow-related Python packages and Rec SDK. # By default, the tf1 image is built. To build the tf2 image, modify the parameters. ARG TF_VER=1.15.0 ARG TF1_PLUGIN=npu_bridge-1.15.0-*.whl ARG TF2_PLUGIN=npu_device-2.6.5-*.whl RUN pip3.7 install tensorflow==${TF_VER} && \ pip3.7 install tf_slim && \ HOROVOD_WITH_MPI=1 HOROVOD_WITH_TENSORFLOW=1 pip3.7 install horovod --no-cache-dir && \ pip3.7 install $TF1_PLUGIN --force-reinstall && \ pip3.7 install $TF2_PLUGIN --force-reinstall && \ tar -zxvf Ascend-mindxsdk-mxrec*.tar.gz && \ pip3.7 install mindxsdk-mxrec/{tf1|tf2}_whl/mx_rec-*.whl --force-reinstall && \ rm -rf /root/.cache/pip # 13. Clear the temporary directory. RUN rm -rf ./*
- Go to the build_images directory and run the following command to build a Rec SDK TensorFlow image:
docker build -t {Image name}:{Image tag} -f Dockerfile.
Parent topic: Deploying the Development Environment in Containers