Preparing Model Scripts

The dataset used in this section is ImageNet2012. (Note: Use the dataset according to the usage specifications of the dataset provider.) For details about how to preprocess datasets in the TensorFlow framework, see "Dataset Preparation" in Preparations.

Select an example based on the model framework.

TensorFlow
PyTorch
MindSpore

TensorFlow

Download ResNet50_ID0360_for_TensorFlow2.X of the master branch in the TensorFlow code repository as the training code.
Upload the dataset to the storage node as an administrator.
1. Go to the /data/atlas_dls/public directory and upload the dataset to any directory, for example, /data/atlas_dls/public/dataset/resnet50/imagenet_TF.
```
root@ubuntu:/data/atlas_dls/public/dataset/resnet50/imagenet_TF# pwd
/data/atlas_dls/public/dataset/resnet50/imagenet_TF
```
2. Run the du -sh command to check the dataset size.
```
root@ubuntu:/data/atlas_dls/public/dataset/resnet50/imagenet_TF# du -sh
42G
```
Decompress the training code downloaded in 1 on the local host and rename the ResNet50_ID0360_for_TensorFlow2.X directory in ModelZoo-TensorFlow-master/TensorFlow2/built-in/cv/image_classification/ to the ResNet50_for_TensorFlow_2.6_code/ directory.

Go to the MindXDL-deploy repository and select the master branch. Obtain the train_start.sh, rank_table.sh, and utils.sh files from the samples/train directory. Construct the following directory structure in the /data/atlas_dls/public/code directory on the host with the ResNet50_for_TensorFlow_2.6_code directory in 3.

/data/atlas_dls/public/code/ResNet50_for_TensorFlow_2.6_code/
├──  scripts
│   ├──  train_start.sh
│   ├──  utils.sh
│   ├──  rank_table.sh
│    ...
│        ...
├──  tensorflow
│   ├──  resnet_ctl_imagenet_main.py
│   ├──  resnet_model.py
│   ├──  resnet_runnable.py
│    ...
│        ...
├──  benchmark.sh
├──  modelzoo_level.txt
 ...
└──  requirements.txt

Go to the /data/atlas_dls/public/code/ResNet50_for_TensorFlow_2.6_code/tensorflow/ directory and modify the resnet_ctl_imagenet_main.py file as follows:

   ...
   import json
   import npu_device
   import os  # Add this line.

   flags.DEFINE_boolean(name='use_tf_function', default=True,
   ...

   ...
   checkpoint_manager = tf.train.CheckpointManager(
        runnable.checkpoint,
        directory=flags_obj.model_dir+"/tf-checkpoint/ckpt-"+os.getenv("RANK_ID"), # Modify this line.
        max_to_keep=10,
        step_counter=runnable.global_step,
        checkpoint_interval=checkpoint_interval)
   ...

PyTorch

Download ResNet50_for_PyTorch of the master branch in the PyTorch code repository as the training code.
Prepare a dataset corresponding to ResNet-50, and comply with corresponding specifications when using the dataset.
Upload the dataset to the storage node as an administrator.
1. Go to the /data/atlas_dls/public directory and upload the dataset to any directory, for example, /data/atlas_dls/public/dataset/resnet50/imagenet.
```
root@ubuntu:/data/atlas_dls/public/dataset/resnet50/imagenet# pwd
/data/atlas_dls/public/dataset/resnet50/imagenet
```
2. Run the du -sh command to check the dataset size.
```
root@ubuntu:/data/atlas_dls/public/dataset/resnet50/imagenet# du -sh
11G 
```
Decompress the training code downloaded in 1 on the local host and rename the ModelZoo-PyTorch/PyTorch/built-in/cv/classification/ResNet50_for_PyTorch directory in the decompressed training code to the ResNet50_for_PyTorch_1.5_code/ directory.

Go to the MindXDL-deploy repository and select the master branch. Obtain the train_start.sh, rank_table.sh, and utils.sh files from the samples/train directory. Construct the following directory structure in the /data/atlas_dls/code directory on the host with the ResNet50_for_PyTorch_1.5_code directory in 4.

root@ubuntu:/data/atlas_dls/public/code/ResNet50_for_PyTorch_1.5_code/#
ResNet50_for_PyTorch_1.5_code/
├── DistributedResnet50
├── infer
├── test
├── ...
├── Dockerfile
├── eval.sh
├── python2onx.py
├── pytorch_resnet50_apex.py
└── scripts
     ├── train_start.sh
     ├── utils.sh
     └── rank_table.sh

(Optional) Go to the /data/atlas_dls/public/code/ResNet50_for_PyTorch_1.5_code/scripts/ directory and modify the train_start.sh file as follows. If the file has been modified, skip this step.

 ...
    if [ "${framework}" == "PyTorch" ]; then
      get_env_for_pytorch_multi_node_job
      ${DLS_PROGRAM_EXECUTOR} ${boot_file_path}${boot_file} ${train_param} --addr=${MASTER_ADDR} --world-size=${WORLD_SIZE} --rank=${RANK} && tee ${log_url}  # Modify this line.
      check_return_code
      if [[ $@ =~ need_freeze ]]; then
        ${DLS_PROGRAM_EXECUTOR} ${boot_file_path}${freeze_cmd} --addr=${MASTER_ADDR} --world-size=${WORLD_SIZE} --rank=${RANK} && tee ${log_url}   
        check_return_code
      fi
 ...

MindSpore

Download the ResNet code of the r1.9 branch from the MindSpore code repository as the training code.
Prepare a dataset corresponding to ResNet-50, and comply with corresponding specifications when using the dataset.
Upload the dataset to the storage node as an administrator.
1. Go to the /data/atlas_dls/public directory and upload the dataset to any directory, for example, /data/atlas_dls/public/dataset/imagenet.
```
root@ubuntu:/data/atlas_dls/public/dataset/imagenet# pwd
/data/atlas_dls/public/dataset/imagenet
```
2. Run the du -sh command to check the dataset size.
```
root@ubuntu:/data/atlas_dls/public/dataset/imagenet# du -sh
11G
```
Decompress the training code downloaded in 1 on the local host and rename the resnet directory in models-r1.9/models-r1.9/official/cv/ to the ResNet50_for_MindSpore_1.9_code directory. The ResNet50_for_MindSpore_1.9_code directory is used as an example in the following steps.

Go to the MindXDL-deploy repository and select the master branch. Obtain the train_start.sh, utils.sh, and rank_table.sh files from the samples/train directory and construct the following directory structure on the host with the scripts directory in the training code.

root@ubuntu:/data/atlas_dls/public/code/ResNet50_for_MindSpore_1.9_code/scripts/#
scripts/
├── docker_start.sh
├── run_standalone_train_gpu.sh
├── run_standalone_train.sh
 ...
├── rank_table.sh
├── utils.sh
└── train_start.sh

Parent topic: Training Job