Creating a Resumable Training Container Image Using a Dockerfile (MindSpore)
Prerequisites
Obtain the software package of the corresponding OS and the Dockerfile and script files required for packaging images. For details, see Table 1. {version} in the name of the software package indicates the version number.
Software Package |
Description |
How to Obtain |
|---|---|---|
mindx_elastic-{version}-py3x-none-linux_{arch}.whl |
WHL package of cluster scheduling component, which provides the dying gasp feature of resumable training. {arch} indicates the CPU architecture, which can be AArch 64 or x86. Currently, Python 3.7 and Python 3.9 are supported. In the software package name, x indicates 7 or 9. Select a software package as required. |
|
Dockerfile |
Required for creating an image. |
Prepared by users. |
To prevent a software package from being maliciously tampered with during transmission or storage, download the corresponding digital signature file for integrity verification when downloading the software package.
After the software package is downloaded, verify its PGP digital signature according to the OpenPGP Signature Verification Guide. If the software package fails the verification, do not use the software package, and contact Huawei technical support.
Before a software package is used in installation or upgrade, its digital signature also needs to be verified according to OpenPGP Signature Verification Guide to ensure that the software package is not tampered with.
For enterprise users, visit https://support.huawei.com/enterprise/en/tool/pgp-verify-TL1000000054.
This section uses Ubuntu as an example.
Procedure
- Upload the software package mindx_elastic-{version}-py3x-none-linux_{arch}.whl to any directory (for example, /home/test) on the server.
- Log in to the server as the root user.
- Perform the following steps to prepare the Dockerfile file:
- Go to the directory where the software packages are stored and run the following command to create the Dockerfile file:
vim Dockerfile
- For details about the content to be written, see Dockerfile compilation example. After the content is written, run the :wq command to save the content.
- Go to the directory where the software packages are stored and run the following command to create the Dockerfile file:
- Go to the directory where the software packages are stored and run the following command to create a container image. Do not omit the period (.) at the end of the command.
docker build -t [OPTIONS] Image name_System architecture:Image tag.
See the following example:
docker build -t test_train_arm64:v1.0 .
Table 2 describes the commands.
Table 2 Parameters in the commands Parameter
Overview
-t
Image name.
OPTIONS
--disable-content-trust: ignores verification. This function is enabled by default. For security purposes, you are advised to disable this function.
Image name_System architecture:Image tag
Image name and tag. Change them based on the actual situation.
If "Successfully built xxx" is displayed, the image has been created.
- After the image is created, run the following command to view the image information:
docker images
Example:
REPOSITORY TAG IMAGE ID CREATED SIZE test_train_arm64 v1.0 d82746acd7f0 27 minutes ago 749MB
Compilation Examples
Modify the software package version and architecture based on the actual situation.
- Dockerfile compilation sample
- Dockerfile example for Ubuntu ARM
FROM xxx # Base training image ARG MINDX_ELASTIC_PKG=mindx_elastic-{version}-py39-none-linux_aarch64.whl WORKDIR /tmp COPY . ./ ENV http_proxy xxx ENV https_proxy xxx # Configure the Python pip source. RUN mkdir -p ~/.pip \ && echo '[global] \n\ index-url=https://pypi.doubanio.com/simple/\n\ trusted-host=pypi.doubanio.com' >> ~/.pip/pip.conf # Adaptation script for MindX DL resumable training without loss RUN pip3.7 install $MINDX_ELASTIC_PKG # To restrict the directory permission on the installation file, change the permission based on the Python installation path. The recommended permission on the program directory and file is 550. Command example: chmod 550 -R mindx_elastic installation_path. ENV http_proxy "" ENV https_proxy "" RUN rm -f /tmp/$MINDX_ELASTIC_PKG - Dockerfile example for Ubuntu x86
FROM xxx # Base training image ARG MINDX_ELASTIC_PKG=mindx_elastic-{version}-py39-none-linux_x86_64.whl WORKDIR /tmp COPY . ./ ENV http_proxy xxx ENV https_proxy xxx # Configure the Python pip source. RUN mkdir -p ~/.pip \ && echo '[global] \n\ index-url=https://pypi.doubanio.com/simple/\n\ trusted-host=pypi.doubanio.com' >> ~/.pip/pip.conf # Adaptation script for MindX DL resumable training without loss RUN pip3.7 install $MINDX_ELASTIC_PKG # To restrict the directory permission on the installation file, change the permission based on the Python installation path. The recommended permission on the program directory and file is 550. Command example: chmod 550 -R mindx_elastic installation_path. ENV http_proxy "" ENV https_proxy "" RUN rm -f /tmp/$MINDX_ELASTIC_PKG
To make Docker files more secure, you can define HEALTHCHECK in Docker files based on services. Run the HEALTHCHECK [OPTIONS] CMD command in the container to check the running status of the container.
- Dockerfile example for Ubuntu ARM