Creating a MindSpeed-LLM Training Image (PyTorch)
MindSpeed-LLM, as the Ascend model training framework, offers an end-to-end training solution for Ascend processors. It supports distributed pre-training, instruction fine-tuning, preference alignment, and a comprehensive development toolchain. MindSpeed-LLM User Guide covers chapters on repository pulling, environment setup, and model training. You can refer to the user guide and this section to create a MindSpeed-LLM training image.
Resumable training can be implemented based on a base training image. For details about how to create the base training image, see Creating a Container Image Using a Dockerfile (PyTorch).
This section describes how to create a training image based on Ubuntu 20.04.
The following example uses the MindSpeed-LLM 2.3.0 branch.
Obtaining Software Packages
Obtain the software packages of the corresponding OS and prepare the Dockerfile and script file required by the image by referring to Table 1. In the software package name, {version} indicates the version number, {arch} indicates the architecture, and {chip_type} indicates the processor type.
To avoid using a software package that has been tampered with during transmission or storage, download its digital signature file for integrity check while downloading the software package.
After the software package is downloaded from the Support website, verify its PGP digital signature by referring to the OpenPGP Signature Verification Guide. If the verification fails, do not use the software package, and contact Huawei technical support.
Before using software for installation or upgrade, verify the digital signature to ensure that the software has not been tampered with.
For carriers, visit https://support.huawei.com/carrier/digitalSignatureAction.
For enterprises, visit https://support.huawei.com/enterprise/en/tool/pgp-verify-TL1000000054.
This following uses a single Atlas 800T A2 training server running on Ubuntu 20.04 Arm with Python 3.10 as an example to describe how to create a training image. Modify the steps as required.
Procedure
- Prepare software packages on the host by referring to Table 1.
- Create a Dockerfile.
FROM ubuntu:20.04 WORKDIR /root COPY . . ARG PYTORCH_WHL=torch-2.7.1+cpu-cp310-cp310-manylinux_2_28_aarch64.whl ARG PYTORCH_NPU_WHL=torch_npu-2.7.1.{version}-cp310-cp310-manylinux_2_28_aarch64.whl ARG APEX_WHL=apex-0.1+ascend-cp310-cp310-linux_aarch64.whl ARG HOST_ASCEND_BASE=/usr/local/Ascend ARG TOOLKIT_PATH=/usr/local/Ascend/cann # The following uses CANN 8.5.0, as an example. Modify the following information based on the actual situation. ARG TOOLKIT=Ascend-cann-toolkit_8.5.0_linux-aarch64.run ARG OPS=Ascend-cann-910b-ops_8.5.0_linux-aarch64.run ARG TASKD_WHL=taskd-7.3.0-py3-none-linux_aarch64.whl ARG MINDIO_TTP_WHL=mindio_ttp-1.0.0-py3-none-linux_aarch64.whl ARG MINDSPEED=MindSpeed ARG DLLOGGER=dllogger RUN echo "nameserver 114.114.114.114" > /etc/resolv.conf RUN echo "deb http://repo.huaweicloud.com/ubuntu-ports/ focal main restricted universe multiverse\n\ deb http://repo.huaweicloud.com/ubuntu-ports/ focal-updates main restricted universe multiverse\n\ deb http://repo.huaweicloud.com/ubuntu-ports/ focal-backports main restricted universe multiverse\n\ deb http://ports.ubuntu.com/ubuntu-ports/ focal-security main restricted universe multiverse" > /etc/apt/sources.list ARG DEBIAN_FRONTEND=noninteractive # System packages RUN umask 0022 && apt update && \ apt-get install -y --no-install-recommends \ software-properties-common RUN umask 0022 && add-apt-repository ppa:deadsnakes/ppa && \ apt update && \ apt autoremove -y python python3 && \ apt install -y python3.10 python3.10-dev # Create Python soft links. RUN ln -s /usr/bin/python3.10 /usr/bin/python RUN ln -s /usr/bin/python3.10 /usr/bin/python3 RUN ln -s /usr/bin/python3.10-config /usr/bin/python-config RUN ln -s /usr/bin/python3.10-config /usr/bin/python3-config # System dependencies RUN umask 0022 && apt update && \ apt-get install -y --no-install-recommends \ gcc g++ make cmake vim \ zlib1g zlib1g-dev \ openssl libsqlite3-dev libssl-dev \ libffi-dev unzip pciutils \ net-tools libblas-dev \ gfortran libblas3 libopenblas-dev \ curl unzip liblapack3 liblapack-dev \ libhdf5-dev libxml2 patch # Time zone RUN ln -sf /usr/share/zoneinfo/UTC /etc/localtime # Configure the pip mirror. RUN mkdir -p ~/.pip \ && echo '[global] \n\ index-url=https://mirrors.huaweicloud.com/repository/pypi/simple\n\ trusted-host=mirrors.huaweicloud.com' >> ~/.pip/pip.conf # pip3.10 RUN cd /tmp && \ apt-get download python3-distutils && \ dpkg-deb -x python3-distutils_*.deb / && \ rm python3-distutils_*.deb && \ cd - && \ python get-pip.py && \ rm get-pip.py RUN umask 0022 && \ pip install sympy==1.4 && \ pip install cffi && \ pip install pathlib2 && \ pip install grpcio && \ pip install grpcio-tools && \ pip install torchvision==0.22.1 && \ pip install transformers==4.51.0 && \ pip install absl-py && \ pip install datasets && \ pip install tokenizers==0.20.1 && \ pip install pyOpenSSL RUN useradd -d /home/HwHiAiUser -u 1000 -m -s /bin/bash HwHiAiUser # Install torch, torch_npu, and Apex. RUN umask 0022 && pip install $PYTORCH_WHL && \ pip install $PYTORCH_NPU_WHL && \ pip install $APEX_WHL # Ascend packages # Copy the /usr/local/Ascend/driver/version.info file on the host to the current directory first. RUN umask 0022 && \ cp ascend_install.info /etc/ && \ mkdir -p /usr/local/Ascend/driver/ && \ cp version.info /usr/local/Ascend/driver/ && \ chmod +x $TOOLKIT && \ chmod +x $OPS RUN umask 0022 && ./$TOOLKIT --install-path=/usr/local/Ascend/ --install --quiet RUN echo "source /usr/local/Ascend/cann/set_env.sh" >> ~/.bashrc RUN umask 0022 && ./$OPS --install --quiet # After the toolkit package is installed, clear the following files. During container startup, the toolkit package is mounted by Ascend Docker Runtime. RUN rm -f version.info && rm -f ascend_install.info \ rm -rf /usr/local/Ascend/driver/ RUN umask 0022 && cd $MINDSPEED && \ pip install -r requirements.txt && \ pip install -e . && \ echo "export PYTHONPATH=/root/MindSpeed:\$PYTHONPATH" >> ~/.bashrc RUN umask 0022 && cd $DLLOGGER && \ python setup.py build && \ python setup.py install # Import the following environment variable. ENV HCCL_WHITELIST_DISABLE=1 # Create /lib64/ld-linux-aarch64.so.1. RUN umask 0022 && \ if [ ! -d "/lib64" ]; \ then \ mkdir /lib64 && ln -sf /lib/ld-linux-aarch64.so.1 /lib64/ld-linux-aarch64.so.1; \ fi # MindCluster resumable training adaptation script RUN umask 0022 && \ pip install $TASKD_WHL && \ pip install $MINDIO_TTP_WHL # (Optional) If graceful fault tolerance, pod-level rescheduling, or process-level rescheduling is required, configure the following command. RUN sed -i '/import os/i import taskd.python.adaptor.patch' $(pip3 show torch | grep Location | awk -F ' ' '{print $2}')/torch/distributed/run.py # Install the job scheduling dependency library. RUN pip install apscheduler RUN rm -rf tmp && \ rm -f $PYTORCH_WHL && \ rm -f $PYTORCH_NPU_WHL && \ rm -f $APEX_WHL && \ rm -f $TOOLKIT && \ rm -f $OPS && \ rm -f $TASKD_WHL && \ rm -f $MINDIO_TTP_WHL && \ rm -rf $DLLOGGER && \ rm -rf Dockerfile ## Pack the preceding content into the image mindspeed-dl:v1.
If Python 3.10 cannot be installed directly through PPA or deadsnakes PPA does not provide the image mirror of Python 3.10, you can download the source code and manually compile and install it.
- Build the image. Run the following command to generate the image. To make the Dockerfile more secure, you can HEALTHCHECK in the Dockerfile based on service requirements. Run the HEALTHCHECK [OPTIONS] CMD command in the container to check the running status of the container. Do not omit the period (.) at the end of the command.
docker build -t mindspeed-dl:v1 .