Creating a MindSpeed-LLM Training Image (PyTorch)

MindSpeed-LLM, as the Ascend model training framework, offers an end-to-end training solution for Ascend processors. It supports distributed pre-training, instruction fine-tuning, preference alignment, and a comprehensive development toolchain. MindSpeed-LLM User Guide covers chapters on repository pulling, environment setup, and model training. You can refer to the user guide and this section to create a MindSpeed-LLM training image.

Resumable training can be implemented based on a base training image. For details about how to create the base training image, see Creating a Container Image Using a Dockerfile (PyTorch).

This section describes how to create a training image based on Ubuntu 20.04.

The following example uses the MindSpeed-LLM 2.3.0 branch.

Obtaining Software Packages

Obtain the software packages of the corresponding OS and prepare the Dockerfile and script file required by the image by referring to Table 1. In the software package name, {version} indicates the version number, {arch} indicates the architecture, and {chip_type} indicates the processor type.

**Table 1** Required software packages
Software Package	Mandatory (Yes/No)	Description	How to Obtain
taskd-{version}-py3-none-linux_{arch}.whl	Yes	.whl package of the resumable training component. NOTE: Before installing TaskD, ensure that PyTorch has been correctly installed. The supported PyTorch versions are 2.1.0, 2.3.0, 2.4.0, 2.5.0, 2.6.0, and 2.7.1.	Link NOTE: The link points to the download page of the *Ascend-mindxdl-taskd_{version}_linux-{arch}.zip* package. You need to decompress the package to obtain the required .whl package.
mindio_ttp-{version}-py3-none-linux_{arch}.whl	Yes	MindIO TFT installation package	Link
apex-0.1+ascend-cp3x-cp3x-linux_{arch}.whl	Yes MindSpeed-LLM dependency	Mixed precision training refers to training that uses both single-precision (float32) and half-precision (float16) data types and the same hyperparameters to achieve nearly the same precision as training using only float32. x can be 10 or 11. Currently, Python 3.10 and Python 3.11 are supported.	Compile the Apex package as required.
torch_npu-2.7.1.{version}-cp3x-cp3x-manylinux_2_28_{arch}.whl	Yes MindSpeed-LLM dependency	Ascend Extension for PyTorch is a deep learning plugin that adapts Ascend NPUs to the PyTorch framework to allow PyTorch users to access the superb computing power of Ascend AI processors. x can be 10 or 11. Currently, Python 3.10 and Python 3.11 are supported.	Link NOTE: The PyTorch model in the MindSpeed-LLM code repository requires Ascend Extension for PyTorch 2.6.0 or later.
x86_64: torch-2.7.1+cpu.cxx11.abi-cp3x-cp3x-linux_x86_64.whl ARM: torch-2.7.1+cpu-cp3x-cp3x-manylinux_2_28_aarch64.whl	Yes MindSpeed-LLM dependency	Official PyTorch package. x can be 10 or 11. Currently, Python 3.10 and Python 3.11 are supported.	Link
Ascend-cann-{chip_type}-ops_{version}_linux-{arch}.run	Yes For versions earlier than CANN 8.5.0, the package name is *Ascend-cann-kernels-{chip_type}_{version}_linux-{arch}.run*.	CANN operator package.	Link NOTE: Obtain a software package that matches the server model.
Ascend-cann-toolkit_{version}_linux-{arch}.run	Yes	CANN ToolKit package	Link NOTE: Obtain a software package that matches the server model.
MindSpeed	Yes	MindSpeed is a foundation model acceleration library for Ascend devices.	git clone https://gitcode.com/Ascend/MindSpeed.git cd MindSpeed git checkout 2.3.0_core_r0.12.1
version.info	Yes CANN installation dependency	Driver version information file.	Copy the /usr/local/Ascend/driver/version.info file from the host.
ascend_install.info	Yes CANN installation dependency	Driver installation information file.	Copy the /etc/ascend_install.info file from the host.
DLLogger code repository	Yes	PyTorch log tool.	git clone https://github.com/NVIDIA/dllogger.git
get-pip.py	Yes	Required for installing the pip module.	curl -k https://bootstrap.pypa.io/get-pip.py -o get-pip.py
Dockerfile	Yes	Required for creating an image.	-

To avoid using a software package that has been tampered with during transmission or storage, download its digital signature file for integrity check while downloading the software package.

After the software package is downloaded from the Support website, verify its PGP digital signature by referring to the OpenPGP Signature Verification Guide. If the verification fails, do not use the software package, and contact Huawei technical support.

Before using software for installation or upgrade, verify the digital signature to ensure that the software has not been tampered with.

For carriers, visit https://support.huawei.com/carrier/digitalSignatureAction.

For enterprises, visit https://support.huawei.com/enterprise/en/tool/pgp-verify-TL1000000054.

This following uses a single Atlas 800T A2 training server running on Ubuntu 20.04 Arm with Python 3.10 as an example to describe how to create a training image. Modify the steps as required.

Procedure

Prepare software packages on the host by referring to Table 1.

Create a Dockerfile.

FROM ubuntu:20.04 
WORKDIR /root 
COPY . . 
  
ARG PYTORCH_WHL=torch-2.7.1+cpu-cp310-cp310-manylinux_2_28_aarch64.whl 
ARG PYTORCH_NPU_WHL=torch_npu-2.7.1.{version}-cp310-cp310-manylinux_2_28_aarch64.whl 
ARG APEX_WHL=apex-0.1+ascend-cp310-cp310-linux_aarch64.whl 
ARG HOST_ASCEND_BASE=/usr/local/Ascend 
ARG TOOLKIT_PATH=/usr/local/Ascend/cann 
# The following uses CANN 8.5.0, as an example. Modify the following information based on the actual situation.
ARG TOOLKIT=Ascend-cann-toolkit_8.5.0_linux-aarch64.run    
ARG OPS=Ascend-cann-910b-ops_8.5.0_linux-aarch64.run 
ARG TASKD_WHL=taskd-7.3.0-py3-none-linux_aarch64.whl   
ARG MINDIO_TTP_WHL=mindio_ttp-1.0.0-py3-none-linux_aarch64.whl 
ARG MINDSPEED=MindSpeed 
ARG DLLOGGER=dllogger 
  
RUN echo "nameserver 114.114.114.114" > /etc/resolv.conf 
  
RUN echo "deb http://repo.huaweicloud.com/ubuntu-ports/ focal main restricted universe multiverse\n\ 
deb http://repo.huaweicloud.com/ubuntu-ports/ focal-updates main restricted universe multiverse\n\ 
deb http://repo.huaweicloud.com/ubuntu-ports/ focal-backports main restricted universe multiverse\n\ 
deb http://ports.ubuntu.com/ubuntu-ports/ focal-security main restricted universe multiverse" > /etc/apt/sources.list 
  
ARG DEBIAN_FRONTEND=noninteractive 
  
# System packages
RUN umask 0022 && apt update && \
    apt-get install -y --no-install-recommends \
    software-properties-common
RUN umask 0022 && add-apt-repository ppa:deadsnakes/ppa && \
    apt update && \
    apt autoremove -y python python3 && \
    apt install -y python3.10 python3.10-dev
# Create Python soft links.
RUN ln -s /usr/bin/python3.10 /usr/bin/python
RUN ln -s /usr/bin/python3.10 /usr/bin/python3
RUN ln -s /usr/bin/python3.10-config /usr/bin/python-config
RUN ln -s /usr/bin/python3.10-config /usr/bin/python3-config
# System dependencies
RUN umask 0022 && apt update && \
        apt-get install -y --no-install-recommends \
        gcc g++ make cmake vim \
        zlib1g zlib1g-dev \
        openssl libsqlite3-dev libssl-dev \
        libffi-dev unzip pciutils \
        net-tools libblas-dev \
        gfortran libblas3 libopenblas-dev \
        curl unzip liblapack3 liblapack-dev \
        libhdf5-dev libxml2 patch
# Time zone
RUN ln -sf /usr/share/zoneinfo/UTC /etc/localtime
# Configure the pip mirror.
RUN mkdir -p ~/.pip \
&& echo '[global] \n\
index-url=https://mirrors.huaweicloud.com/repository/pypi/simple\n\
trusted-host=mirrors.huaweicloud.com' >> ~/.pip/pip.conf
# pip3.10
RUN cd /tmp && \
    apt-get download python3-distutils && \
    dpkg-deb -x python3-distutils_*.deb / && \
    rm python3-distutils_*.deb && \
    cd - && \
    python get-pip.py && \
    rm get-pip.py
RUN umask 0022 && \
    pip install sympy==1.4 && \
    pip install cffi && \
    pip install pathlib2 && \
    pip install grpcio && \
    pip install grpcio-tools && \
    pip install torchvision==0.22.1 && \
    pip install transformers==4.51.0 && \
    pip install absl-py && \
    pip install datasets && \
    pip install tokenizers==0.20.1 && \
    pip install pyOpenSSL
RUN useradd -d /home/HwHiAiUser -u 1000 -m -s /bin/bash HwHiAiUser
# Install torch, torch_npu, and Apex.
RUN umask 0022 && pip install $PYTORCH_WHL && \
    pip install $PYTORCH_NPU_WHL && \
    pip install $APEX_WHL
  
# Ascend packages
# Copy the /usr/local/Ascend/driver/version.info file on the host to the current directory first.
RUN umask 0022 &&  \ 
    cp ascend_install.info /etc/ && \ 
    mkdir -p /usr/local/Ascend/driver/ && \ 
    cp version.info /usr/local/Ascend/driver/ && \ 
    chmod +x $TOOLKIT && \ 
    chmod +x $OPS 
  
RUN umask 0022 && ./$TOOLKIT --install-path=/usr/local/Ascend/ --install --quiet 
RUN echo "source /usr/local/Ascend/cann/set_env.sh" >> ~/.bashrc 
RUN umask 0022 && ./$OPS --install --quiet 
  
# After the toolkit package is installed, clear the following files. During container startup, the toolkit package is mounted by Ascend Docker Runtime.
RUN rm -f version.info && rm -f ascend_install.info \ 
    rm -rf /usr/local/Ascend/driver/ 
  
RUN umask 0022 && cd $MINDSPEED && \ 
    pip install -r requirements.txt && \ 
    pip install -e . && \ 
    echo "export PYTHONPATH=/root/MindSpeed:\$PYTHONPATH" >> ~/.bashrc 
  
RUN umask 0022 && cd $DLLOGGER && \ 
    python setup.py build && \ 
    python setup.py install 
  
# Import the following environment variable.
ENV HCCL_WHITELIST_DISABLE=1 
  
# Create /lib64/ld-linux-aarch64.so.1.
RUN umask 0022 && \ 
    if [ ! -d "/lib64" ]; \ 
    then \ 
        mkdir /lib64 && ln -sf /lib/ld-linux-aarch64.so.1 /lib64/ld-linux-aarch64.so.1; \ 
    fi 
  
# MindCluster resumable training adaptation script
RUN umask 0022 && \ 
    pip install $TASKD_WHL && \ 
    pip install $MINDIO_TTP_WHL 
  
  
# (Optional) If graceful fault tolerance, pod-level rescheduling, or process-level rescheduling is required, configure the following command.
RUN sed -i '/import os/i import taskd.python.adaptor.patch' $(pip3 show torch | grep Location | awk -F ' ' '{print $2}')/torch/distributed/run.py  

# Install the job scheduling dependency library.
RUN pip install apscheduler 

RUN rm -rf tmp && \ 
    rm -f $PYTORCH_WHL && \ 
    rm -f $PYTORCH_NPU_WHL && \ 
    rm -f $APEX_WHL && \ 
    rm -f $TOOLKIT && \ 
    rm -f $OPS && \ 
    rm -f $TASKD_WHL && \ 
    rm -f $MINDIO_TTP_WHL && \ 
    rm -rf $DLLOGGER && \ 
    rm -rf Dockerfile 
## Pack the preceding content into the image mindspeed-dl:v1.

If Python 3.10 cannot be installed directly through PPA or deadsnakes PPA does not provide the image mirror of Python 3.10, you can download the source code and manually compile and install it.

Build the image. Run the following command to generate the image. To make the Dockerfile more secure, you can HEALTHCHECK in the Dockerfile based on service requirements. Run the HEALTHCHECK [OPTIONS] CMD command in the container to check the running status of the container. Do not omit the period (.) at the end of the command.
```
docker build -t mindspeed-dl:v1 .
```

Parent topic: Image Creation