Creating a MindSpeed-LLM Training Image (PyTorch)

MindSpeed-LLM, as the Ascend model training framework, offers an end-to-end training solution for Ascend processors. It supports distributed pre-training, instruction fine-tuning, preference alignment, and a comprehensive development toolchain. MindSpeed-LLM User Guide covers chapters on repository pulling, environment setup, and model training. You can refer to the user guide and this section to create a MindSpeed-LLM training image.

Resumable training can be implemented based on a base training image. For details about how to create the base training image, see Creating a Container Image Using a Dockerfile (PyTorch).

This section describes how to create a training image based on Ubuntu 20.04.

The following example uses the MindSpeed-LLM 2.3.0 branch.

Obtaining Software Packages

Obtain the software packages of the corresponding OS and prepare the Dockerfile and script file required by the image by referring to Table 1. In the software package name, {version} indicates the version number, {arch} indicates the architecture, and {chip_type} indicates the processor type.

Table 1 Required software packages

Software Package

Mandatory (Yes/No)

Description

How to Obtain

taskd-{version}-py3-none-linux_{arch}.whl

Yes

.whl package of the resumable training component.

NOTE:

Before installing TaskD, ensure that PyTorch has been correctly installed. The supported PyTorch versions are 2.1.0, 2.3.0, 2.4.0, 2.5.0, 2.6.0, and 2.7.1.

Link

NOTE:

The link points to the download page of the Ascend-mindxdl-taskd_{version}_linux-{arch}.zip package. You need to decompress the package to obtain the required .whl package.

mindio_ttp-{version}-py3-none-linux_{arch}.whl

Yes

MindIO TFT installation package

Link

apex-0.1+ascend-cp3x-cp3x-linux_{arch}.whl

Yes

MindSpeed-LLM dependency

Mixed precision training refers to training that uses both single-precision (float32) and half-precision (float16) data types and the same hyperparameters to achieve nearly the same precision as training using only float32.

x can be 10 or 11. Currently, Python 3.10 and Python 3.11 are supported.

Compile the Apex package as required.

torch_npu-2.7.1.{version}-cp3x-cp3x-manylinux_2_28_{arch}.whl

Yes

MindSpeed-LLM dependency

Ascend Extension for PyTorch is a deep learning plugin that adapts Ascend NPUs to the PyTorch framework to allow PyTorch users to access the superb computing power of Ascend AI processors.

x can be 10 or 11. Currently, Python 3.10 and Python 3.11 are supported.

Link

NOTE:

The PyTorch model in the MindSpeed-LLM code repository requires Ascend Extension for PyTorch 2.6.0 or later.

  • x86_64: torch-2.7.1+cpu.cxx11.abi-cp3x-cp3x-linux_x86_64.whl
  • ARM: torch-2.7.1+cpu-cp3x-cp3x-manylinux_2_28_aarch64.whl

Yes

MindSpeed-LLM dependency

Official PyTorch package. x can be 10 or 11. Currently, Python 3.10 and Python 3.11 are supported.

Link

Ascend-cann-{chip_type}-ops_{version}_linux-{arch}.run

Yes

For versions earlier than CANN 8.5.0, the package name is Ascend-cann-kernels-{chip_type}_{version}_linux-{arch}.run.

CANN operator package.

Link

NOTE:

Obtain a software package that matches the server model.

Ascend-cann-toolkit_{version}_linux-{arch}.run

Yes

CANN ToolKit package

Link

NOTE:

Obtain a software package that matches the server model.

MindSpeed

Yes

MindSpeed is a foundation model acceleration library for Ascend devices.

git clone https://gitcode.com/Ascend/MindSpeed.git

cd MindSpeed

git checkout 2.3.0_core_r0.12.1

version.info

Yes

CANN installation dependency

Driver version information file.

Copy the /usr/local/Ascend/driver/version.info file from the host.

ascend_install.info

Yes

CANN installation dependency

Driver installation information file.

Copy the /etc/ascend_install.info file from the host.

DLLogger code repository

Yes

PyTorch log tool.

git clone https://github.com/NVIDIA/dllogger.git

get-pip.py

Yes

Required for installing the pip module.

curl -k https://bootstrap.pypa.io/get-pip.py -o get-pip.py

Dockerfile

Yes

Required for creating an image.

-

To avoid using a software package that has been tampered with during transmission or storage, download its digital signature file for integrity check while downloading the software package.

After the software package is downloaded from the Support website, verify its PGP digital signature by referring to the OpenPGP Signature Verification Guide. If the verification fails, do not use the software package, and contact Huawei technical support.

Before using software for installation or upgrade, verify the digital signature to ensure that the software has not been tampered with.

For carriers, visit https://support.huawei.com/carrier/digitalSignatureAction.

For enterprises, visit https://support.huawei.com/enterprise/en/tool/pgp-verify-TL1000000054.

This following uses a single Atlas 800T A2 training server running on Ubuntu 20.04 Arm with Python 3.10 as an example to describe how to create a training image. Modify the steps as required.

Procedure

  1. Prepare software packages on the host by referring to Table 1.
  2. Create a Dockerfile.
    FROM ubuntu:20.04 
    WORKDIR /root 
    COPY . . 
      
    ARG PYTORCH_WHL=torch-2.7.1+cpu-cp310-cp310-manylinux_2_28_aarch64.whl 
    ARG PYTORCH_NPU_WHL=torch_npu-2.7.1.{version}-cp310-cp310-manylinux_2_28_aarch64.whl 
    ARG APEX_WHL=apex-0.1+ascend-cp310-cp310-linux_aarch64.whl 
    ARG HOST_ASCEND_BASE=/usr/local/Ascend 
    ARG TOOLKIT_PATH=/usr/local/Ascend/cann 
    # The following uses CANN 8.5.0, as an example. Modify the following information based on the actual situation.
    ARG TOOLKIT=Ascend-cann-toolkit_8.5.0_linux-aarch64.run    
    ARG OPS=Ascend-cann-910b-ops_8.5.0_linux-aarch64.run 
    ARG TASKD_WHL=taskd-7.3.0-py3-none-linux_aarch64.whl   
    ARG MINDIO_TTP_WHL=mindio_ttp-1.0.0-py3-none-linux_aarch64.whl 
    ARG MINDSPEED=MindSpeed 
    ARG DLLOGGER=dllogger 
      
    RUN echo "nameserver 114.114.114.114" > /etc/resolv.conf 
      
    RUN echo "deb http://repo.huaweicloud.com/ubuntu-ports/ focal main restricted universe multiverse\n\ 
    deb http://repo.huaweicloud.com/ubuntu-ports/ focal-updates main restricted universe multiverse\n\ 
    deb http://repo.huaweicloud.com/ubuntu-ports/ focal-backports main restricted universe multiverse\n\ 
    deb http://ports.ubuntu.com/ubuntu-ports/ focal-security main restricted universe multiverse" > /etc/apt/sources.list 
      
    ARG DEBIAN_FRONTEND=noninteractive 
      
    # System packages
    RUN umask 0022 && apt update && \
        apt-get install -y --no-install-recommends \
        software-properties-common
    RUN umask 0022 && add-apt-repository ppa:deadsnakes/ppa && \
        apt update && \
        apt autoremove -y python python3 && \
        apt install -y python3.10 python3.10-dev
    # Create Python soft links.
    RUN ln -s /usr/bin/python3.10 /usr/bin/python
    RUN ln -s /usr/bin/python3.10 /usr/bin/python3
    RUN ln -s /usr/bin/python3.10-config /usr/bin/python-config
    RUN ln -s /usr/bin/python3.10-config /usr/bin/python3-config
    # System dependencies
    RUN umask 0022 && apt update && \
            apt-get install -y --no-install-recommends \
            gcc g++ make cmake vim \
            zlib1g zlib1g-dev \
            openssl libsqlite3-dev libssl-dev \
            libffi-dev unzip pciutils \
            net-tools libblas-dev \
            gfortran libblas3 libopenblas-dev \
            curl unzip liblapack3 liblapack-dev \
            libhdf5-dev libxml2 patch
    # Time zone
    RUN ln -sf /usr/share/zoneinfo/UTC /etc/localtime
    # Configure the pip mirror.
    RUN mkdir -p ~/.pip \
    && echo '[global] \n\
    index-url=https://mirrors.huaweicloud.com/repository/pypi/simple\n\
    trusted-host=mirrors.huaweicloud.com' >> ~/.pip/pip.conf
    # pip3.10
    RUN cd /tmp && \
        apt-get download python3-distutils && \
        dpkg-deb -x python3-distutils_*.deb / && \
        rm python3-distutils_*.deb && \
        cd - && \
        python get-pip.py && \
        rm get-pip.py
    RUN umask 0022 && \
        pip install sympy==1.4 && \
        pip install cffi && \
        pip install pathlib2 && \
        pip install grpcio && \
        pip install grpcio-tools && \
        pip install torchvision==0.22.1 && \
        pip install transformers==4.51.0 && \
        pip install absl-py && \
        pip install datasets && \
        pip install tokenizers==0.20.1 && \
        pip install pyOpenSSL
    RUN useradd -d /home/HwHiAiUser -u 1000 -m -s /bin/bash HwHiAiUser
    # Install torch, torch_npu, and Apex.
    RUN umask 0022 && pip install $PYTORCH_WHL && \
        pip install $PYTORCH_NPU_WHL && \
        pip install $APEX_WHL
      
    # Ascend packages
    # Copy the /usr/local/Ascend/driver/version.info file on the host to the current directory first.
    RUN umask 0022 &&  \ 
        cp ascend_install.info /etc/ && \ 
        mkdir -p /usr/local/Ascend/driver/ && \ 
        cp version.info /usr/local/Ascend/driver/ && \ 
        chmod +x $TOOLKIT && \ 
        chmod +x $OPS 
      
    RUN umask 0022 && ./$TOOLKIT --install-path=/usr/local/Ascend/ --install --quiet 
    RUN echo "source /usr/local/Ascend/cann/set_env.sh" >> ~/.bashrc 
    RUN umask 0022 && ./$OPS --install --quiet 
      
    # After the toolkit package is installed, clear the following files. During container startup, the toolkit package is mounted by Ascend Docker Runtime.
    RUN rm -f version.info && rm -f ascend_install.info \ 
        rm -rf /usr/local/Ascend/driver/ 
      
    RUN umask 0022 && cd $MINDSPEED && \ 
        pip install -r requirements.txt && \ 
        pip install -e . && \ 
        echo "export PYTHONPATH=/root/MindSpeed:\$PYTHONPATH" >> ~/.bashrc 
      
    RUN umask 0022 && cd $DLLOGGER && \ 
        python setup.py build && \ 
        python setup.py install 
      
    # Import the following environment variable.
    ENV HCCL_WHITELIST_DISABLE=1 
      
    # Create /lib64/ld-linux-aarch64.so.1.
    RUN umask 0022 && \ 
        if [ ! -d "/lib64" ]; \ 
        then \ 
            mkdir /lib64 && ln -sf /lib/ld-linux-aarch64.so.1 /lib64/ld-linux-aarch64.so.1; \ 
        fi 
      
    # MindCluster resumable training adaptation script
    RUN umask 0022 && \ 
        pip install $TASKD_WHL && \ 
        pip install $MINDIO_TTP_WHL 
      
      
    # (Optional) If graceful fault tolerance, pod-level rescheduling, or process-level rescheduling is required, configure the following command.
    RUN sed -i '/import os/i import taskd.python.adaptor.patch' $(pip3 show torch | grep Location | awk -F ' ' '{print $2}')/torch/distributed/run.py  
    
    # Install the job scheduling dependency library.
    RUN pip install apscheduler 
    
    RUN rm -rf tmp && \ 
        rm -f $PYTORCH_WHL && \ 
        rm -f $PYTORCH_NPU_WHL && \ 
        rm -f $APEX_WHL && \ 
        rm -f $TOOLKIT && \ 
        rm -f $OPS && \ 
        rm -f $TASKD_WHL && \ 
        rm -f $MINDIO_TTP_WHL && \ 
        rm -rf $DLLOGGER && \ 
        rm -rf Dockerfile 
    ## Pack the preceding content into the image mindspeed-dl:v1.

    If Python 3.10 cannot be installed directly through PPA or deadsnakes PPA does not provide the image mirror of Python 3.10, you can download the source code and manually compile and install it.

  3. Build the image. Run the following command to generate the image. To make the Dockerfile more secure, you can HEALTHCHECK in the Dockerfile based on service requirements. Run the HEALTHCHECK [OPTIONS] CMD command in the container to check the running status of the container. Do not omit the period (.) at the end of the command.
    docker build -t mindspeed-dl:v1 .