Process of Starting a Job
- Log in to ModelArts. On the Training Management page, select .
- Click Create Training Job.
- Set or select mandatory options as prompted.
- Create a Pangu-alpha training job. Models are stored in OBS. For details about parameter configurations, see related documents on the ROMA platform and the following table Table 1.
- Create a Pangu-alpha training job. Models are stored in EFS. For details about parameter configurations, see related documents on the ROMA platform and the following table Table 1.
Table 1 Description and configuration Parameter
Description and Configuration
Region
Select the option corresponding to the training resource.
Job Name/ID
User-defined
Data Source
Select private data.
Job Cluster Type
Select Homogeneous cluster.
Algorithm
Select defined image.
Defined Image
Upload the custom image of the pre-installed mindx_elastic package.
- Download the specified training image from the Ascend image repository or build a training image by following the instructions in Creating a Resumable Training Container Image on ModelArts Using a Dockerfile (MindSpore). Run the docker save Image name > xxx.tar command to save the image file to the local PC.
- Upload the local custom image to the OBS storage on the ROMA platform. For details, see related documents.
- Create an image on the image management page of ModelArts. Select the corresponding region, configure the image name and version, set the image type to Custom Upload, and set the storage location to the OBS storage path in the previous step.
Select the uploaded image and the corresponding version.
Run Command
Select Preset Command.
AI Engine
Select Ascend-Powered-Engine and mindspore_1.6.1-cann_5.0.4-py_3.7-euler_2.8.3-aarch64.
Code Directory
Select the root directory for storing the training script in OBS. For a code example, see Script Adaptation.
Boot File
Select the startup script of the training job in OBS, for example, train.py.
Running Parameter
Set this parameter based on the specific training job.
Environment variable
The configuration procedure is as follows:
- MA_TERMINATION_GRACE_PERIOD_SECONDS: graceful exit time of the training job. Set this option based on the model scale.
- CHECKPOINT_PATH: If the storage type is EFS, set this option to the EFS path where the model is stored based on the local mount address configured by the user. If the storage type is OBS, you do not need to set this option.
Training Input
Select the Data path and set it to the OBS path for storing the dataset.
The default value of the code path parameters is data_url. Retain this value.
If the storage type is OBS, click + to add an input parameter and set the parameter name to checkpoint_url.
Training Output
If the storage type is EFS, do not select this option.
If the storage type is OBS, select the OBS path to save the training output, for example, checkpoints.
Code path parameters is set to train_url by default, and it must be consistent with the OBS path of checkpoint_url in Train Input.
NAS Mounted
If the storage type is EFS, select the configured EFS resource.
If the storage type is OBS, you do not need to set this option.
Local Mount Address
If the storage type is EFS, set this option to the absolute path mounted to the local host.
If the storage type is OBS, you do not need to set this option.
Job Log Path
Set it to the user-defined OBS path for storing logs.
Select Resource
Select a resource pool based on the configured resources.
Worker Nodes
Select the number of nodes based on the training job requirements.
Fault Auto Restart
If this option is enabled, the training job is rescheduled when a fault occurs.
Priority
User-defined
- Click Confirm.