Resource Information Configuration Using Environment Variables
Select a guide example based on the model framework.
- TensorFlow
- PyTorch
- MindSpore
- The dataset used in this section is ImageNet2012. (Note: Use the dataset according to the usage specifications of the dataset provider.) For TensorFlow, preprocess the dataset by referring to "Samples" > "Preparations" in TensorFlow 1.15 Model Porting Guide.
- The sample model code provided below may differ from the actual version. Please use the actual version code.
- The following TensorFlow and MindSpore examples require CANN versions earlier than 8.5.0.
TensorFlow
- Download ResNet50_ID0360_for_TensorFlow2.X from the master branch of the TensorFlow code repository and use it as the training code. Select the TensorFlow version package in the training image based on the TensorFlow version of model code.
- Upload the dataset to the storage node as an administrator.
- Go to the /data/atlas_dls/public directory and upload the dataset to any directory, for example, /data/atlas_dls/public/dataset/resnet50/imagenet_TF.
root@ubuntu:/data/atlas_dls/public/dataset/resnet50/imagenet_TF# pwd
Command output:1/data/atlas_dls/public/dataset/resnet50/imagenet_TF
- Run the du -sh command to check the dataset size.
root@ubuntu:/data/atlas_dls/public/dataset/resnet50/imagenet_TF# du -sh
Command output:142G
- Go to the /data/atlas_dls/public directory and upload the dataset to any directory, for example, /data/atlas_dls/public/dataset/resnet50/imagenet_TF.
- Decompress the training code downloaded in 1 on the local host and rename the ResNet50_ID0360_for_TensorFlow2.X directory in ModelZoo-TensorFlow-master/TensorFlow2/built-in/cv/image_classification/ to the ResNet50_for_TensorFlow_2.6_code/ directory.
- Upload the ResNet50_for_TensorFlow_2.6_code file to the /data/atlas_dls/public/code/ directory in the environment.
- Go to the mindcluster-deploy repository and access the corresponding branch based on mindcluster-deploy Version Description. Obtain the train_start.sh file from the samples/train/basic-training/without-ranktable/tensorflow directory, and combine the file with the ResNet50_for_TensorFlow_2.6_code directory in 3 to construct the following directory structure in the /data/atlas_dls/public/code directory on the host:
/data/atlas_dls/public/code/ResNet50_for_TensorFlow_2.6_code/ ├── scripts │ ├── train_start.sh │ ... │ ... ├── tensorflow │ ├── resnet_ctl_imagenet_main.py │ ├── resnet_model.py │ ├── resnet_runnable.py │ ... │ ... ├── benchmark.sh ├── modelzoo_level.txt ... └── requirements.txt
PyTorch
- Download ResNet50_ID4149_for_PyTorch from the master branch in the PyTorch code repository and use it as the training code.
- Prepare a dataset corresponding to ResNet-50, and comply with corresponding specifications when using the dataset.
- Upload the dataset to the storage node as an administrator.
- Go to the /data/atlas_dls/public directory and upload the dataset to any directory, for example, /data/atlas_dls/public/dataset/resnet50/imagenet.
root@ubuntu:/data/atlas_dls/public/dataset/resnet50/imagenet# pwd
Command output:1/data/atlas_dls/public/dataset/resnet50/imagenet
- Run the du -sh command to check the dataset size.
root@ubuntu:/data/atlas_dls/public/dataset/resnet50/imagenet# du -sh
Command output:111G
- Go to the /data/atlas_dls/public directory and upload the dataset to any directory, for example, /data/atlas_dls/public/dataset/resnet50/imagenet.
- Decompress the training code downloaded in 1 to the local host, and upload the ModelZoo-PyTorch/PyTorch/built-in/cv/classification/ResNet50_ID4149_for_PyTorch directory in the decompressed training code to a directory in the environment, for example, /data/atlas_dls/public/code/.
- In the /data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch directory, comment out or delete the fields in bold in the main.py file.
def main(): args = parser.parse_args() os.environ['MASTER_ADDR'] = args.addr #os.environ['MASTER_PORT'] = '29501' # Comment out or delete this line. if os.getenv('ALLOW_FP32', False) and os.getenv('ALLOW_HF32', False): raise RuntimeError('ALLOW_FP32 and ALLOW_HF32 cannot be set at the same time!') elif os.getenv('ALLOW_HF32', False): torch.npu.conv.allow_hf32 = True elif os.getenv('ALLOW_FP32', False): torch.npu.conv.allow_hf32 = False torch.npu.matmul.allow_hf32 = False - Go to the mindcluster-deploy repository and access the corresponding branch based on mindcluster-deploy Version Description. Obtain the train_start.sh file from the samples/train/basic-training/without-ranktable/pytorch directory, and construct the following directory structure in the /data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch/scripts directory:
root@ubuntu:/data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch/scripts# scripts/ ├── train_start.sh
MindSpore
- Download the ResNet code of the master branch from the MindSpore code repository and use it as the training code.
- Prepare a dataset corresponding to ResNet-50, and comply with corresponding specifications when using the dataset.
- Upload the dataset to the storage node as an administrator.
- Go to the /data/atlas_dls/public directory and upload the dataset to any directory, for example, /data/atlas_dls/public/dataset/imagenet.
root@ubuntu:/data/atlas_dls/public/dataset/imagenet# pwd
Command output:1/data/atlas_dls/public/dataset/imagenet
- Run the du -sh command to check the dataset size.
root@ubuntu:/data/atlas_dls/public/dataset/imagenet# du -sh
Command output:111G
- Go to the /data/atlas_dls/public directory and upload the dataset to any directory, for example, /data/atlas_dls/public/dataset/imagenet.
- Decompress the training code downloaded in 1 to the local host and rename the ResNet directory in models/official/cv/ to the ResNet50_for_MindSpore_2.0_code directory. The ResNet50_for_MindSpore_2.0_code directory is used as an example in the following steps.
- Upload the ResNet50_for_MindSpore_2.0_code file to the /data/atlas_dls/public/code/ directory in the environment.
- Go to the mindcluster-deploy repository and access the corresponding branch based on mindcluster-deploy Version Description. Obtain the train_start.sh file from the samples/train/basic-training/without-ranktable/mindspore directory, and combine the file with the scripts directory in the training code to construct the following directory structure on the host:
root@ubuntu:/data/atlas_dls/public/code/ResNet50_for_MindSpore_2.0_code/scripts/# scripts/ ├── docker_start.sh ├── run_standalone_train_gpu.sh ├── run_standalone_train.sh ... └── train_start.sh
- Go to the /data/atlas_dls/public/code/ResNet50_for_MindSpore_2.0_code/train.py directory and modify the train.py file as follows:
... if config.run_distribute: if target == "Ascend": #device_id = int(os.getenv('DEVICE_ID', '0')) # Comment out this line of code. #ms.set_context(device_id=device_id) # Comment out this line of code. ms.set_auto_parallel_context(device_num=config.device_num, parallel_mode=ms.ParallelMode.DATA_PARALLEL, gradients_mean=True) set_algo_parameters(elementwise_op_strategy_follow=True) if config.net_name == "resnet50" or config.net_name == "se-resnet50": if config.boost_mode not in ["O1", "O2"]: ms.set_auto_parallel_context(all_reduce_fusion_config=config.all_reduce_fusion_config) elif config.net_name in ["resnet101", "resnet152"]: ms.set_auto_parallel_context(all_reduce_fusion_config=config.all_reduce_fusion_config) init() # GPU target ...
Parent topic: Script Adaptation