Preparing Model Scripts
Select an example based on the model framework.
TensorFlow
- Download ResNet50_ID0360_for_TensorFlow2.X of the master branch in the TensorFlow code repository as the training code.
- Upload the dataset to the storage node as an administrator.
- Go to the /data/atlas_dls/public directory and upload the dataset to any directory, for example, /data/atlas_dls/public/dataset/resnet50/imagenet_TF.
root@ubuntu:/data/atlas_dls/public/dataset/resnet50/imagenet_TF# pwd /data/atlas_dls/public/dataset/resnet50/imagenet_TF
- Run the du -sh command to check the dataset size.
root@ubuntu:/data/atlas_dls/public/dataset/resnet50/imagenet_TF# du -sh 42G
- Go to the /data/atlas_dls/public directory and upload the dataset to any directory, for example, /data/atlas_dls/public/dataset/resnet50/imagenet_TF.
- Decompress the training code downloaded in 1 on the local host and rename the ResNet50_ID0360_for_TensorFlow2.X directory in ModelZoo-TensorFlow-master/TensorFlow2/built-in/cv/image_classification/ to the ResNet50_for_TensorFlow_2.6_code/ directory.
- Go to the MindXDL-deploy repository and select the master branch. Obtain the train_start.sh, rank_table.sh, and utils.sh files from the samples/train directory. Construct the following directory structure in the /data/atlas_dls/public/code directory on the host with the ResNet50_for_TensorFlow_2.6_code directory in 3.
/data/atlas_dls/public/code/ResNet50_for_TensorFlow_2.6_code/ ├── scripts │ ├── train_start.sh │ ├── utils.sh │ ├── rank_table.sh │ ... │ ... ├── tensorflow │ ├── resnet_ctl_imagenet_main.py │ ├── resnet_model.py │ ├── resnet_runnable.py │ ... │ ... ├── benchmark.sh ├── modelzoo_level.txt ... └── requirements.txt
- Go to the /data/atlas_dls/public/code/ResNet50_for_TensorFlow_2.6_code/tensorflow/ directory and modify the resnet_ctl_imagenet_main.py file as follows:
... import json import npu_device import os # Add this line. flags.DEFINE_boolean(name='use_tf_function', default=True, ... ... checkpoint_manager = tf.train.CheckpointManager( runnable.checkpoint, directory=flags_obj.model_dir+"/tf-checkpoint/ckpt-"+os.getenv("RANK_ID"), # Modify this line. max_to_keep=10, step_counter=runnable.global_step, checkpoint_interval=checkpoint_interval) ...
PyTorch
- Download ResNet50_for_PyTorch of the master branch in the PyTorch code repository as the training code.
- Prepare a dataset corresponding to ResNet-50, and comply with corresponding specifications when using the dataset.
- Upload the dataset to the storage node as an administrator.
- Go to the /data/atlas_dls/public directory and upload the dataset to any directory, for example, /data/atlas_dls/public/dataset/resnet50/imagenet.
root@ubuntu:/data/atlas_dls/public/dataset/resnet50/imagenet# pwd /data/atlas_dls/public/dataset/resnet50/imagenet
- Run the du -sh command to check the dataset size.
root@ubuntu:/data/atlas_dls/public/dataset/resnet50/imagenet# du -sh 11G
- Go to the /data/atlas_dls/public directory and upload the dataset to any directory, for example, /data/atlas_dls/public/dataset/resnet50/imagenet.
- Decompress the training code downloaded in 1 on the local host and rename the ModelZoo-PyTorch/PyTorch/built-in/cv/classification/ResNet50_for_PyTorch directory in the decompressed training code to the ResNet50_for_PyTorch_1.5_code/ directory.
- Go to the MindXDL-deploy repository and select the master branch. Obtain the train_start.sh, rank_table.sh, and utils.sh files from the samples/train directory. Construct the following directory structure in the /data/atlas_dls/code directory on the host with the ResNet50_for_PyTorch_1.5_code directory in 4.
root@ubuntu:/data/atlas_dls/public/code/ResNet50_for_PyTorch_1.5_code/# ResNet50_for_PyTorch_1.5_code/ ├── DistributedResnet50 ├── infer ├── test ├── ... ├── Dockerfile ├── eval.sh ├── python2onx.py ├── pytorch_resnet50_apex.py └── scripts ├── train_start.sh ├── utils.sh └── rank_table.sh - (Optional) Go to the /data/atlas_dls/public/code/ResNet50_for_PyTorch_1.5_code/scripts/ directory and modify the train_start.sh file as follows. If the file has been modified, skip this step.
... if [ "${framework}" == "PyTorch" ]; then get_env_for_pytorch_multi_node_job ${DLS_PROGRAM_EXECUTOR} ${boot_file_path}${boot_file} ${train_param} --addr=${MASTER_ADDR} --world-size=${WORLD_SIZE} --rank=${RANK} && tee ${log_url} # Modify this line. check_return_code if [[ $@ =~ need_freeze ]]; then ${DLS_PROGRAM_EXECUTOR} ${boot_file_path}${freeze_cmd} --addr=${MASTER_ADDR} --world-size=${WORLD_SIZE} --rank=${RANK} && tee ${log_url} check_return_code fi ...
MindSpore
- Download the ResNet code of the r1.9 branch from the MindSpore code repository as the training code.
- Prepare a dataset corresponding to ResNet-50, and comply with corresponding specifications when using the dataset.
- Upload the dataset to the storage node as an administrator.
- Go to the /data/atlas_dls/public directory and upload the dataset to any directory, for example, /data/atlas_dls/public/dataset/imagenet.
root@ubuntu:/data/atlas_dls/public/dataset/imagenet# pwd /data/atlas_dls/public/dataset/imagenet
- Run the du -sh command to check the dataset size.
root@ubuntu:/data/atlas_dls/public/dataset/imagenet# du -sh 11G
- Go to the /data/atlas_dls/public directory and upload the dataset to any directory, for example, /data/atlas_dls/public/dataset/imagenet.
- Decompress the training code downloaded in 1 on the local host and rename the resnet directory in models-r1.9/models-r1.9/official/cv/ to the ResNet50_for_MindSpore_1.9_code directory. The ResNet50_for_MindSpore_1.9_code directory is used as an example in the following steps.
- Go to the MindXDL-deploy repository and select the master branch. Obtain the train_start.sh, utils.sh, and rank_table.sh files from the samples/train directory and construct the following directory structure on the host with the scripts directory in the training code.
root@ubuntu:/data/atlas_dls/public/code/ResNet50_for_MindSpore_1.9_code/scripts/# scripts/ ├── docker_start.sh ├── run_standalone_train_gpu.sh ├── run_standalone_train.sh ... ├── rank_table.sh ├── utils.sh └── train_start.sh
Parent topic: Training Job