Creating a Resumable Training Container Image on ModelArts Using a Dockerfile (MindSpore)

Prerequisites

Obtain the software packages of the corresponding OS and the Dockerfile required for packaging images based on the following table. {version} indicates the version number and {arch} indicates the architecture. Currently, ModelArts which is equipped with Ascend AI Processors supports only AArch64 images.

Table 1 Required software packages

Software Package

Description

How to Obtain

mindspore_ascend-{version}-cp37-cp37m-linux_{arch}.whl

WHL package of the MindSpore framework. Select the AArch64 architecture.

Link

Ascend-cann-toolkit_{version}_linux-{arch}.run

CANN development suite package. Select the AArch64 architecture.

Link

Dockerfile

Required for creating an image.

See the following example.

mindx_elastic-{version}-py37-none-linux_{arch}.whl

WHL package of cluster scheduling component, which provides the dying gasp feature of resumable training. Select the AArch64 architecture.

Link

Procedure

  1. Upload the preceding software packages to any directory on the server.
  2. Log in to the server as the root user.
  3. For details about the Dockerfile content, see the following example:
    # ModelArts base image. Obtain the base image path by referring to ModelArts documentation. The base image version must be V2.
    ARG base
    FROM ${base}
     
    USER root
     
    COPY ./* ./tmp/
    # Install MindSpore and mindx_elastic.
    RUN cd ./tmp \
             && /home/ma-user/anaconda3/envs/MindSpore/bin/pip3.7 install mindspore*.whl \
             && /home/ma-user/anaconda3/envs/MindSpore/bin/pip3.7 install mindx_elastic*.whl \
             && chmod +x ./*.run \
             && ./*toolkit*.run --upgrade \
             && cd ../ ; rm -rf ./tmp; exit 0
     
    USER ma-user
    WORKDIR /home/ma-user
  4. Go to the directory where the software packages are stored and run the following command to create a container image. Do not omit the period (.) at the end of the command.
    docker build --build-arg base=Base_image_path -t  [OPTIONS] Image name_System architecture:Image tag .

    Example:

    docker build --build-arg base=swr.cn-north-4.myhuaweicloud.com/modelarts-job-dev-image/mindspore-ascend910-cp37-euleros2.8-aarch64-training:1.3.0-3.3.0-roma -t test_train_arm64:v1.0 .

    The following table describes the command options.

    Option

    Overview

    --build-arg

    Passes parameters defined in the Dockerfile.

    -t

    Image name.

    OPTIONS

    --disable-content-trust: ignores verification. It is enabled by default. For security purposes, you are advised to disable this function.

    Image name_System architecture:Image tag

    Image name and tag. Change them based on the actual situation.

    If "Successfully built xxx" is displayed, the image has been created.

  5. After the image is created, run the following command to view the image information:
    docker images
    The following command output is displayed:
    REPOSITORY          TAG    IMAGE_ID  CREATED         SIZE
    test_train_arm64    V1.0   xxxxxxx   XX minutes ago  XXMB