Use on the CLI (Other Schedulers)

The process of using other schedulers on the CLI is the same as that of using Volcano. The only difference is that the required job YAML files are different. Prepare and use the corresponding YAML files by referring to Use on the CLI (Volcano).

Procedure

  1. Upload the YAML file to any directory on the management node and modify the file content as required.

    No YAML example is provided for using other schedulers for cluster scheduling. However, you can obtain the YAML example of Volcano and modify it as follows.

    The following uses tensorflow_standalone_acjob.yaml as an example to describe how to create a single-server training job on an Atlas 800T A2 training server. The job uses eight processors. The modification example is as follows:
    apiVersion: mindxdl.gitee.com/v1
    kind: AscendJob
    metadata:
      name: default-test-tensorflow
      labels:
        framework: tensorflow
        ring-controller.atlas: ascend-{xxx}b   
    spec:
      schedulerName: volcano        # Delete this field when using other schedulers.
      runPolicy:                    # Delete this field when using other schedulers.
        schedulingPolicy:           
          minAvailable: 1
          queue: default
      successPolicy: AllWorkers
      replicaSpecs:
        Chief:
          replicas: 1
          restartPolicy: Never
          template:
            metadata:
              labels:
                ring-controller.atlas: ascend-{xxx}b   
            spec:
              nodeSelector:
                host-arch: huawei-arm
                accelerator-type: module-{xxx}b-8
              containers:
              - name: ascend                    
    ...
                env:
    ...
             # ASCEND_VISIBLE_DEVICES is not supported when other schedulers are used. Delete the following information in bold:
              - name: ASCEND_VISIBLE_DEVICES                       
                valueFrom:
                  fieldRef:
                    fieldPath: metadata.annotations['huawei.com/Ascend910']               
    ...
                resources:
                  limits:
                    huawei.com/Ascend910: 8
                  requests:
                    huawei.com/Ascend910: 8
                volumeMounts:
    ...
  2. To configure CPU and memory resources, manually add the cpu and memory parameters and their values by referring to the following example. Set the specific values as required.
    ...
              resources:  
                requests:
                  huawei.com/Ascend910: 8
                  cpu: 100m            
                  memory: 100Gi      
                limits:
                  huawei.com/Ascend910: 8
                  cpu: 100m
                  memory: 100Gi
    ...
  3. Modify the mount paths of the training script and code.

    The base image pulled from Ascend image repository does not contain files such as the training script and code. During training, these files are usually mounted and mapped to a container.

              volumeMounts:
              - name: ascend-910-config
                mountPath: /user/serverid/devindex/config
              - name: code
                mountPath: /job/code                     # Path of the training script in the container.
              - name: data
                mountPath: /job/data                      # Path of the training dataset in the container.
              - name: output
                mountPath: /job/output                    # Path of the training output in the container.
  4. As shown below, the three parameters following bash train_start.sh in the YAML file are the directory of the training code in the container, the output directory (contains the generated log redirection file and TensorFlow model file), and the relative path of the startup script to the code directory. The subsequent parameters starting with -- are required by the training script. For details about how to modify the single-server and distributed training scripts and script parameters, see the model description in the model script source.
    • TensorFlow command parameters
      command:
      - "/bin/bash"
      - "-c"
      - "cd /job/code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ /job/output/ tensorflow/resnet_ctl_imagenet_main.py --data_dir=/job/data/imagenet_TF --distribution_strategy=one_device --use_tf_while_loop=true --epochs_between_evals=1 --skip_eval --enable_checkpoint_and_export;"
      ...
    • PyTorch command parameters
      command:
      - "/bin/bash"
      - "-c"
      - "cd /job/code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ /job/output/ main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --lr=1.6 --world-size=1 --dist-backend='hccl' --multiprocessing-distributed --epochs=90 --batch-size=1024;"
      ...
    • MindSpore command parameters
      command:
      - "/bin/bash"
      - "-c"
      - "cd /job/code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ /job/output/ train.py  --config_path=/job/code/config/resnet50_imagenet2012_config.yaml --output_dir=/job/output --run_distribute=True --device_num=8 --data_path=/job/data/imagenet/train"
      ...
    The TensorFlow command parameters are used as an example.
    • /job/code/: path of the training script in the container, which is defined in Step 3.
    • /job/output/: path of the training dataset in the container, which is defined in Step 3.
    • tensorflow/resnet_ctl_imagenet_main.py: path of the training startup script.
  5. If the NFS is used, specify the NFS server address, training dataset path, script path, and training output path in the YAML file as required. If the NFS is not used, modify the configuration based on the Kubernetes guide.
    ...
              volumeMounts:
              - name: ascend-910-config
                mountPath: /user/serverid/devindex/config
              - name: code
                mountPath: /job/code                     # Path of the training script in the container.
              - name: data
                mountPath: /job/data                      # Path of the training dataset in the container.
              - name: output
                mountPath: /job/output                    # Path of the training output in the container.
    ...
            volumes:
    ...
            - name: code
              nfs:
                server: 127.0.0.1        # IP address of the NFS server.
                path: "xxxxxx"           # Training script path.
            - name: data
              nfs:
                server: 127.0.0.1
                path: "xxxxxx"           # Training dataset path.
            - name: output
              nfs:
                server: 127.0.0.1
                path: "xxxxxx"           # Set the path for saving the script-related model.
    ...