Use on the CLI (Other Schedulers)

The process of using other schedulers on the CLI is the same as that of using Volcano. The only difference is that the required job YAML files are different. Prepare and use the corresponding YAML files by referring to Use on the CLI (Volcano).

Procedure

Upload the YAML file to any directory on the management node and modify the file content as required.

No YAML example is provided for using other schedulers for cluster scheduling. However, you can obtain the YAML example of Volcano and modify it as follows.

The following uses tensorflow_standalone_acjob.yaml as an example to describe how to create a single-server training job on an Atlas 800T A2 training server. The job uses eight processors. The modification example is as follows:

apiVersion: mindxdl.gitee.com/v1
kind: AscendJob
metadata:
  name: default-test-tensorflow
  labels:
    framework: tensorflow
    ring-controller.atlas: ascend-{xxx}b   
spec:
  schedulerName: volcano        # Delete this field when using other schedulers.
  runPolicy:                    # Delete this field when using other schedulers.
    schedulingPolicy:           
      minAvailable: 1
      queue: default
  successPolicy: AllWorkers
  replicaSpecs:
    Chief:
      replicas: 1
      restartPolicy: Never
      template:
        metadata:
          labels:
            ring-controller.atlas: ascend-{xxx}b   
        spec:
          nodeSelector:
            host-arch: huawei-arm
            accelerator-type: module-{xxx}b-8
          containers:
          - name: ascend                    
...
            env:
...
         # ASCEND_VISIBLE_DEVICES is not supported when other schedulers are used. Delete the following information in bold:
          - name: ASCEND_VISIBLE_DEVICES                       
            valueFrom:
              fieldRef:
                fieldPath: metadata.annotations['huawei.com/Ascend910']               
...
            resources:
              limits:
                huawei.com/Ascend910: 8
              requests:
                huawei.com/Ascend910: 8
            volumeMounts:
...

To configure CPU and memory resources, manually add the cpu and memory parameters and their values by referring to the following example. Set the specific values as required.

...
          resources:  
            requests:
              huawei.com/Ascend910: 8
              cpu: 100m            
              memory: 100Gi      
            limits:
              huawei.com/Ascend910: 8
              cpu: 100m
              memory: 100Gi
...

Modify the mount paths of the training script and code.

The base image pulled from Ascend image repository does not contain files such as the training script and code. During training, these files are usually mounted and mapped to a container.

          volumeMounts:
          - name: ascend-910-config
            mountPath: /user/serverid/devindex/config
          - name: code
            mountPath: /job/code                     # Path of the training script in the container.
          - name: data
            mountPath: /job/data                      # Path of the training dataset in the container.
          - name: output
            mountPath: /job/output                    # Path of the training output in the container.

As shown below, the three parameters following bash train_start.sh in the YAML file are the directory of the training code in the container, the output directory (contains the generated log redirection file and TensorFlow model file), and the relative path of the startup script to the code directory. The subsequent parameters starting with -- are required by the training script. For details about how to modify the single-server and distributed training scripts and script parameters, see the model description in the model script source.

TensorFlow command parameters

command:
- "/bin/bash"
- "-c"
- "cd /job/code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ /job/output/ tensorflow/resnet_ctl_imagenet_main.py --data_dir=/job/data/imagenet_TF --distribution_strategy=one_device --use_tf_while_loop=true --epochs_between_evals=1 --skip_eval --enable_checkpoint_and_export;"
...

PyTorch command parameters

command:
- "/bin/bash"
- "-c"
- "cd /job/code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ /job/output/ main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --lr=1.6 --world-size=1 --dist-backend='hccl' --multiprocessing-distributed --epochs=90 --batch-size=1024;"
...

MindSpore command parameters

command:
- "/bin/bash"
- "-c"
- "cd /job/code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ /job/output/ train.py  --config_path=/job/code/config/resnet50_imagenet2012_config.yaml --output_dir=/job/output --run_distribute=True --device_num=8 --data_path=/job/data/imagenet/train"
...

The TensorFlow command parameters are used as an example.

/job/code/: path of the training script in the container, which is defined in Step 3.
/job/output/: path of the training dataset in the container, which is defined in Step 3.
tensorflow/resnet_ctl_imagenet_main.py: path of the training startup script.

If the NFS is used, specify the NFS server address, training dataset path, script path, and training output path in the YAML file as required. If the NFS is not used, modify the configuration based on the Kubernetes guide.

...
          volumeMounts:
          - name: ascend-910-config
            mountPath: /user/serverid/devindex/config
          - name: code
            mountPath: /job/code                     # Path of the training script in the container.
          - name: data
            mountPath: /job/data                      # Path of the training dataset in the container.
          - name: output
            mountPath: /job/output                    # Path of the training output in the container.
...
        volumes:
...
        - name: code
          nfs:
            server: 127.0.0.1        # IP address of the NFS server.
            path: "xxxxxx"           # Training script path.
        - name: data
          nfs:
            server: 127.0.0.1
            path: "xxxxxx"           # Training dataset path.
        - name: output
          nfs:
            server: 127.0.0.1
            path: "xxxxxx"           # Set the path for saving the script-related model.
...

Parent topic: Full NPU Scheduling/Static vNPU Scheduling (Training)