Creating a MindFormers Training Image (MindSpore)

The goal of MindSpore Transformers (MindFormers for short) is to build a full-process development suite for foundation model training, fine-tuning, evaluation, inference, and deployment. It provides mainstream Transformer-based pre-trained models and SOTA downstream task applications in the industry, covering various parallel features. It is expected to help users easily implement foundation model training and innovative R&D.

MindSpore Transformers documentation includes software installation and quick start, which can be used as a reference during image creation.

You can create a training image based on a base training image and MindFormers documentation. For details about how to create the base training image, see Creating a Container Image Using a Dockerfile (MindSpore).

This section describes how to create a training image based on Ubuntu 20.04.

Obtaining Software Packages

Obtain the software packages of the corresponding OS and prepare the Dockerfile and script file required by the image by referring to Table 1. In the software package name, {version} indicates the version number, {arch} indicates the architecture, and {chip_type} indicates the processor type.

Table 1 Required software packages

Software Package

Mandatory (Yes/No)

Description

How to Obtain

MindFormers code repository

Yes

Used to build a full-process development suite for foundation model training, fine-tuning, evaluation, inference, and deployment. It provides mainstream Transformer-based pre-trained models and SOTA downstream task applications in the industry, covering various parallel features.

git clone https://gitee.com/mindspore/mindformers.git

cd mindformers

git checkout f06a946af29c8c7e002a6c49458f513d47b642e5

requirements.txt

No

When MindSpore is installed using pip, an error may be reported during dependency installation. In this case, you can install dependencies first.

wget https://gitee.com/mindspore/mindspore/raw/r2.4.1/requirements.txt

NOTE:

MindSpore must be used with Atlas training product. For details, see MindSpore Installation Guide.

mindspore-{version}-cp3x-cp3x-linux_aarch64.whl

Yes

MindSpore .whl package

Link

mindio_ttp-{version}-py3-none-linux_{arch}.whl

Yes

MindIO TFT installation package

Link

Ascend-cann-{chip_type}-ops_{version}_linux-{arch}.run

Yes

For versions earlier than CANN 8.5.0, the package name is Ascend-cann-kernels-{chip_type}_{version}_linux-{arch}.run.

CANN operator package.

Link

NOTE:

Obtain a software package that matches the server model.

Ascend-cann-toolkit_{version}_linux-{arch}.run

Yes

CANN ToolKit package

Link

NOTE:

Obtain a software package that matches the server model.

taskd-{version}-py3-none-linux_{arch}.whl

Yes

.whl package of the resumable training component.

Link

NOTE:
  • The .whl package must be installed if you want to use graceful fault tolerance, pod-level rescheduling, process-level rescheduling, and process-level online recovery in the MindSpore scenario.
  • The link points to the download page of the Ascend-mindxdl-taskd_{version}_linux-{arch}.zip package. You need to decompress the package to obtain the required .whl package.

version.info

Yes

CANN installation dependency

Driver version information file.

Copy the /usr/local/Ascend/driver/version.info file from the host.

ascend_install.info

Yes

CANN installation dependency

Driver installation information file.

Copy the /etc/ascend_install.info file from the host.

get-pip.py

Yes

Required for installing the pip module.

curl -k https://bootstrap.pypa.io/get-pip.py -o get-pip.py

Dockerfile

Yes

Required for creating an image.

-

To avoid using a software package that has been tampered with during transmission or storage, download its digital signature file for integrity check while downloading the software package.

After the software package is downloaded from the Support website, verify its PGP digital signature by referring to the OpenPGP Signature Verification Guide. If the verification fails, do not use the software package, and contact Huawei technical support.

Before using software for installation or upgrade, verify the digital signature to ensure that the software has not been tampered with.

For carriers, visit https://support.huawei.com/carrier/digitalSignatureAction.

For enterprises, visit https://support.huawei.com/enterprise/en/tool/pgp-verify-TL1000000054.

This section uses a single Atlas 800T A2 training server running on Ubuntu 20.04 with Python 3.10 as an example to describe how to create an image. Modify the related steps as required.

Procedure

  1. Prepare software packages on the host.
  2. Create a Dockerfile as follows.
    FROM ubuntu:20.04
     
    WORKDIR /root
     
    COPY . .
     
    ARG HOST_ASCEND_BASE=/usr/local/Ascend
    ARG TOOLKIT_PATH=/usr/local/Ascend/cann
    # The following uses CANN 8.5.0, as an example. Modify the following information based on the actual situation.
    ARG TOOLKIT=Ascend-cann-toolkit_8.5.0_linux-aarch64.run    
    ARG OPS=Ascend-cann-910b-ops_8.5.0_linux-aarch64.run
    ARG MINDIO_TTP_WHL=mindio_ttp-1.0.0-py3-none-linux_aarch64.whl
    ARG MINDFORMERS=mindformers
    ARG MINDSPORE_REQUIREMENTS=requirements.txt
    ARG MINDSPORE_WHL=mindspore-2.5.0-cp310-cp310-linux_aarch64.whl
    ARG TASKD_WHL=taskd-7.0.RC1-py3-none-linux_aarch64.whl    
     
    RUN echo "nameserver 114.114.114.114" > /etc/resolv.conf
     
    RUN echo "deb http://repo.huaweicloud.com/ubuntu-ports/ focal main restricted universe multiverse\n\
    deb http://repo.huaweicloud.com/ubuntu-ports/ focal-updates main restricted universe multiverse\n\
    deb http://repo.huaweicloud.com/ubuntu-ports/ focal-backports main restricted universe multiverse\n\
    deb http://ports.ubuntu.com/ubuntu-ports/ focal-security main restricted universe multiverse" > /etc/apt/sources.list
     
     
    ARG DEBIAN_FRONTEND=noninteractive
     
    RUN umask 0022 && apt update && \
        apt-get install -y --no-install-recommends \
        software-properties-common
    RUN umask 0022 && add-apt-repository ppa:deadsnakes/ppa && \
        apt update && \
        apt autoremove -y python python3 && \
        apt install -y python3.10 python3.10-dev
     
    # Create a Python soft link.
    RUN ln -s /usr/bin/python3.10 /usr/bin/python
    RUN ln -s /usr/bin/python3.10 /usr/bin/python3
    RUN ln -s /usr/bin/python3.10-config /usr/bin/python-config
    RUN ln -s /usr/bin/python3.10-config /usr/bin/python3-config
     
    # System packages
    RUN umask 0022 && apt update && \
        apt-get install -y --no-install-recommends \
            gcc g++ make cmake vim \
            zlib1g zlib1g-dev \
            openssl libsqlite3-dev libssl-dev \
            libffi-dev unzip pciutils \
            net-tools libblas-dev \
            gfortran libblas3 libopenblas-dev \
            curl unzip liblapack3 liblapack-dev \
            libhdf5-dev libxml2 patch
     
    # Time zone
    # RUN ln -sf /usr/share/zoneinfo/Asia/Shanghai /etc/localtime
    RUN ln -sf /usr/share/zoneinfo/UTC /etc/localtime
     
    # Configure the pip mirror.
    RUN mkdir -p ~/.pip \
    && echo '[global] \n\
    index-url=https://mirrors.huaweicloud.com/repository/pypi/simple\n\
    trusted-host=mirrors.huaweicloud.com' >> ~/.pip/pip.conf
     
    # pip3.10
    RUN cd /tmp && \
        apt-get download python3-distutils && \
        dpkg-deb -x python3-distutils_*.deb / && \
        rm python3-distutils_*.deb && \
        cd - && \
        python get-pip.py && \
    rm get-pip.py
     
    RUN umask 0022 && \
        pip install sympy==1.4 && \
        pip install cffi && \
        pip install pathlib2 && \
        pip install grpcio && \
        pip install grpcio-tools && \
        pip install absl-py && \
        pip install datasets && \
        pip install tokenizers==0.20.1 && \
        pip install pyOpenSSL
     
    # Create the HwHiAiUser user and owner. The values of UIDs and GIDs must be the same as those on the physical machine to avoid generating ownerless files. In the example, the user and the corresponding group are automatically created. The values of UIDs and GIDs are both 1000.
    RUN useradd -d /home/HwHiAiUser -u 1000 -m -s /bin/bash HwHiAiUser
     
    # Ascend packages
    # Copy the /usr/local/Ascend/driver/version.info file on the host to the current directory first.
    RUN umask 0022 &&  \
        cp ascend_install.info /etc/ && \
        mkdir -p /usr/local/Ascend/driver/ && \
        cp version.info /usr/local/Ascend/driver/ && \
        chmod +x $TOOLKIT && \
        chmod +x $OPS
     
    RUN umask 0022 && ./$TOOLKIT --install-path=/usr/local/Ascend/ --install --quiet
    RUN echo "source /usr/local/Ascend/cann/set_env.sh" >> ~/.bashrc
    RUN umask 0022 && ./$OPS --install --quiet
     
    # After the toolkit package is installed, clear the following files. During container startup, the toolkit package is mounted by Ascend Docker.
    RUN rm -f version.info && \
        rm -rf /usr/local/Ascend/driver/
     
    # Install MindSpore.
    RUN umask 0022 && pip uninstall te topi hccl -y && \
             pip install sympy && \
             pip install /usr/local/Ascend/cann/lib64/hccl-*-py3-none-any.whl
    RUN umask 0022 && \
        pip install -r $MINDSPORE_REQUIREMENTS && \
        pip install $MINDSPORE_WHL
     
    # Install MindFormers.
    RUN umask 0022 && cd $MINDFORMERS && \
        pip install -r requirements.txt
     
    # Adaptation script for MindCluster resumable training without loss
    RUN umask 0022 && \
        pip install $MINDIO_TTP_WHL --target=$(pip show mindspore | awk '/Location:/ {print $2}') && \
        pip install $TASKD_WHL
     
     
    # Environment variable
    ENV HCCL_WHITELIST_DISABLE=1
     
    # Create /lib64/ld-linux-aarch64.so.1.
    RUN umask 0022 && \
        if [ ! -d "/lib64" ]; \
        then \
            mkdir /lib64 && ln -sf /lib/ld-linux-aarch64.so.1 /lib64/ld-linux-aarch64.so.1; \
        fi
     
    # Install the job scheduling dependency library.
    RUN pip install apscheduler
     
     
    RUN rm -rf tmp && \
        rm -f $TOOLKIT && \
        rm -f $OPS && \
        rm -f $MINDIO_TTP_WHL && \
        rm -f $MINDSPORE_REQUIREMENTS && \
        rm -f $MINDSPORE_WHL
    ## Pack the preceding content into the image mindformers-dl:v1.
  3. Build the image. Run the following command to generate the image. To make the Dockerfile more secure, you can HEALTHCHECK in the Dockerfile based on service requirements. Run the HEALTHCHECK [OPTIONS] CMD command in the container to check the running status of the container. Do not omit the period (.) at the end of the command.
    docker build -t mindformers-dl:v1 .