Process of Starting a Job

  1. Log in to ModelArts. On the Training Management page, select Training Job V2.
  2. Click Create Training Job.
  3. Set or select mandatory options as prompted.
    • Create a Pangu-alpha training job. Models are stored in OBS. For details about parameter configurations, see related documents on the ROMA platform and the following table Table 1.
    • Create a Pangu-alpha training job. Models are stored in EFS. For details about parameter configurations, see related documents on the ROMA platform and the following table Table 1.
    Table 1 Description and configuration

    Parameter

    Description and Configuration

    Region

    Select the option corresponding to the training resource.

    Job Name/ID

    User-defined

    Data Source

    Select private data.

    Job Cluster Type

    Select Homogeneous cluster.

    Algorithm

    Select defined image.

    Defined Image

    Upload the custom image of the pre-installed mindx_elastic package.

    1. Download the specified training image from the Ascend image repository or build a training image by following the instructions in Creating a Resumable Training Container Image on ModelArts Using a Dockerfile (MindSpore). Run the docker save Image name > xxx.tar command to save the image file to the local PC.
    2. Upload the local custom image to the OBS storage on the ROMA platform. For details, see related documents.
    3. Create an image on the image management page of ModelArts. Select the corresponding region, configure the image name and version, set the image type to Custom Upload, and set the storage location to the OBS storage path in the previous step.

    Select the uploaded image and the corresponding version.

    Run Command

    Select Preset Command.

    AI Engine

    Select Ascend-Powered-Engine and mindspore_1.6.1-cann_5.0.4-py_3.7-euler_2.8.3-aarch64.

    Code Directory

    Select the root directory for storing the training script in OBS. For a code example, see Script Adaptation.

    Boot File

    Select the startup script of the training job in OBS, for example, train.py.

    Running Parameter

    Set this parameter based on the specific training job.

    Environment variable

    The configuration procedure is as follows:

    1. MA_TERMINATION_GRACE_PERIOD_SECONDS: graceful exit time of the training job. Set this option based on the model scale.
    2. CHECKPOINT_PATH: If the storage type is EFS, set this option to the EFS path where the model is stored based on the local mount address configured by the user. If the storage type is OBS, you do not need to set this option.

    Training Input

    Select the Data path and set it to the OBS path for storing the dataset.

    The default value of the code path parameters is data_url. Retain this value.

    If the storage type is OBS, click + to add an input parameter and set the parameter name to checkpoint_url.

    Training Output

    If the storage type is EFS, do not select this option.

    If the storage type is OBS, select the OBS path to save the training output, for example, checkpoints.

    Code path parameters is set to train_url by default, and it must be consistent with the OBS path of checkpoint_url in Train Input.

    NAS Mounted

    If the storage type is EFS, select the configured EFS resource.

    If the storage type is OBS, you do not need to set this option.

    Local Mount Address

    If the storage type is EFS, set this option to the absolute path mounted to the local host.

    If the storage type is OBS, you do not need to set this option.

    Job Log Path

    Set it to the user-defined OBS path for storing logs.

    Select Resource

    Select a resource pool based on the configured resources.

    Worker Nodes

    Select the number of nodes based on the training job requirements.

    Fault Auto Restart

    If this option is enabled, the training job is rescheduled when a fault occurs.

    Priority

    User-defined

  4. Click Confirm.