Resource Information Configuration Using Configuration Files

You can configure resource information using file variables to create the following objects: acjob, vcjob, and deploy. The following uses vcjob and deploy to describe how to perform script adaptation.

  • The dataset used in this section is ImageNet2012. (Note: Use the dataset according to the usage specifications of the dataset provider.) For TensorFlow, preprocess the dataset by referring to "Samples" > "Preparations" in TensorFlow 1.15 Model Porting Guide.
  • The sample model code provided below may differ from the actual version. Please use the actual version code.
  • The following TensorFlow and MindSpore examples require CANN versions earlier than 8.5.0.

TensorFlow

  1. Download ResNet50_ID0360_for_TensorFlow2.X from the master branch of the TensorFlow code repository and use it as the training code. Select the TensorFlow version package in the training image based on the TensorFlow version of model code.
  2. Upload the dataset to the storage node as an administrator.
    1. Go to the /data/atlas_dls/public directory and upload the dataset to any directory, for example, /data/atlas_dls/public/dataset/resnet50/imagenet_TF.
      root@ubuntu:/data/atlas_dls/public/dataset/resnet50/imagenet_TF# pwd
      Command output:
      1
      /data/atlas_dls/public/dataset/resnet50/imagenet_TF
      
    2. Run the du -sh command to check the dataset size.
      root@ubuntu:/data/atlas_dls/public/dataset/resnet50/imagenet_TF# du -sh
      Command output:
      1
      42G
      
  3. Decompress the training code downloaded in 1 on the local host and rename the ResNet50_ID0360_for_TensorFlow2.X directory in ModelZoo-TensorFlow-master/TensorFlow2/built-in/cv/image_classification/ to the ResNet50_for_TensorFlow_2.6_code/ directory.
  4. Upload the ResNet50_for_TensorFlow_2.6_code file to the /data/atlas_dls/public/code/ directory in the environment.
  5. Go to the mindcluster-deploy repository and access the corresponding branch based on mindcluster-deploy Version Description. Obtain the train_start.sh, rank_table.sh, and utils.sh files from the samples/train/basic-training/ranktable directory, and combine the files with the ResNet50_for_TensorFlow_2.6_code directory in 3 to construct the following directory structure in the /data/atlas_dls/public/code directory on the host:
    /data/atlas_dls/public/code/ResNet50_for_TensorFlow_2.6_code/
    ├──  scripts
    │   ├──  train_start.sh
    │   ├──  utils.sh
    │   ├──  rank_table.sh
    │    ...
    │        ...
    ├──  tensorflow
    │   ├──  resnet_ctl_imagenet_main.py
    │   ├──  resnet_model.py
    │   ├──  resnet_runnable.py
    │    ...
    │        ...
    ├──  benchmark.sh
    ├──  modelzoo_level.txt
     ...
    └──  requirements.txt

PyTorch

  1. Download ResNet50_ID4149_for_PyTorch from the master branch in the PyTorch code repository and use it as the training code.
  2. Prepare a dataset corresponding to ResNet-50, and comply with corresponding specifications when using the dataset.
  3. Upload the dataset to the storage node as an administrator.
    1. Go to the /data/atlas_dls/public directory and upload the dataset to any directory, for example, /data/atlas_dls/public/dataset/resnet50/imagenet.
      root@ubuntu:/data/atlas_dls/public/dataset/resnet50/imagenet# pwd
      Command output:
      1
      /data/atlas_dls/public/dataset/resnet50/imagenet
      
    2. Run the du -sh command to check the dataset size.
      root@ubuntu:/data/atlas_dls/public/dataset/resnet50/imagenet# du -sh
      Command output:
      1
      11G
      
  4. Decompress the training code downloaded in 1 to the local host, and upload the ModelZoo-PyTorch/PyTorch/built-in/cv/classification/ResNet50_ID4149_for_PyTorch directory in the decompressed training code to a directory in the environment, for example, /data/atlas_dls/public/code/.
  5. Go to the mindcluster-deploy repository and access the corresponding branch based on mindcluster-deploy Version Description. Obtain the train_start.sh, rank_table.sh, and utils.sh files from the samples/train/basic-training/ranktable directory, and construct the following directory structure in the /data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch/scripts directory:
    root@ubuntu:/data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch/scripts#
    scripts/
         ├── train_start.sh
         ├── utils.sh
         └── rank_table.sh

MindSpore

  1. Download the ResNet code of the master branch from the MindSpore code repository and use it as the training code.
  2. Prepare a dataset corresponding to ResNet-50, and comply with corresponding specifications when using the dataset.
  3. Upload the dataset to the storage node as an administrator.
    1. Go to the /data/atlas_dls/public directory and upload the dataset to any directory, for example, /data/atlas_dls/public/dataset/imagenet.
      root@ubuntu:/data/atlas_dls/public/dataset/imagenet# pwd
      Command output:
      1
      /data/atlas_dls/public/dataset/imagenet
      
    2. Run the du -sh command to check the dataset size.
      root@ubuntu:/data/atlas_dls/public/dataset/imagenet# du -sh
      Command output:
      1
      11G
      
  4. Decompress the training code downloaded in 1 to the local host and rename the ResNet directory in models/official/cv/ to the ResNet50_for_MindSpore_2.0_code directory. The ResNet50_for_MindSpore_2.0_code directory is used as an example in the following steps.
  5. Upload the ResNet50_for_MindSpore_2.0_code file to the /data/atlas_dls/public/code/ directory in the environment.
  6. Go to the mindcluster-deploy repository and access the corresponding branch based on mindcluster-deploy Version Description. Obtain the train_start.sh, utils.sh, and rank_table.sh files from the samples/train/basic-training/ranktable directory, and combine the files with the scripts directory in the training code to construct the following directory structure on the host:
    root@ubuntu:/data/atlas_dls/public/code/ResNet50_for_MindSpore_2.0_code/scripts/#
    scripts/
    ├── cache_util.sh
    ├── docker_start.sh
    ├── run_standalone_train_gpu.sh
    ├── run_standalone_train.sh
     ...
    ├── rank_table.sh
    ├── utils.sh
    └── train_start.sh