Creating a Resumable Training Container Image on ModelArts Using a Dockerfile (MindSpore)
Prerequisites
Obtain the software packages of the corresponding OS and the Dockerfile required for packaging images based on the following table. {version} indicates the version number and {arch} indicates the architecture. Currently, ModelArts which is equipped with Ascend AI Processors supports only AArch64 images.
Software Package |
Description |
How to Obtain |
|---|---|---|
mindspore_ascend-{version}-cp37-cp37m-linux_{arch}.whl |
WHL package of the MindSpore framework. Select the AArch64 architecture. |
|
Ascend-cann-toolkit_{version}_linux-{arch}.run |
CANN development suite package. Select the AArch64 architecture. |
|
Dockerfile |
Required for creating an image. |
See the following example. |
mindx_elastic-{version}-py37-none-linux_{arch}.whl |
WHL package of cluster scheduling component, which provides the dying gasp feature of resumable training. Select the AArch64 architecture. |
Procedure
- Upload the preceding software packages to any directory on the server.
- Log in to the server as the root user.
- For details about the Dockerfile content, see the following example:
# ModelArts base image. Obtain the base image path by referring to ModelArts documentation. The base image version must be V2. ARG base FROM ${base} USER root COPY ./* ./tmp/ # Install MindSpore and mindx_elastic. RUN cd ./tmp \ && /home/ma-user/anaconda3/envs/MindSpore/bin/pip3.7 install mindspore*.whl \ && /home/ma-user/anaconda3/envs/MindSpore/bin/pip3.7 install mindx_elastic*.whl \ && chmod +x ./*.run \ && ./*toolkit*.run --upgrade \ && cd ../ ; rm -rf ./tmp; exit 0 USER ma-user WORKDIR /home/ma-user
- Go to the directory where the software packages are stored and run the following command to create a container image. Do not omit the period (.) at the end of the command.
docker build --build-arg base=Base_image_path -t [OPTIONS] Image name_System architecture:Image tag .
Example:
docker build --build-arg base=swr.cn-north-4.myhuaweicloud.com/modelarts-job-dev-image/mindspore-ascend910-cp37-euleros2.8-aarch64-training:1.3.0-3.3.0-roma -t test_train_arm64:v1.0 .
The following table describes the command options.
Option
Overview
--build-arg
Passes parameters defined in the Dockerfile.
-t
Image name.
OPTIONS
--disable-content-trust: ignores verification. It is enabled by default. For security purposes, you are advised to disable this function.
Image name_System architecture:Image tag
Image name and tag. Change them based on the actual situation.
If "Successfully built xxx" is displayed, the image has been created.
- After the image is created, run the following command to view the image information:
docker images
The following command output is displayed:REPOSITORY TAG IMAGE_ID CREATED SIZE test_train_arm64 V1.0 xxxxxxx XX minutes ago XXMB