Job YAML Configuration Example

For details about the rescheduling mode and graceful fault tolerance mode, see Procedure. If subHealthyStrategy is set to graceExit, adapt the startup script by referring to (Optional) Modifying the Training Script to ensure that the training framework can work with rescheduling.

Prerequisites

You have created a mount path for the hccl.json file. For details, see Step 4.

Procedure

  1. Upload the YAML file to any directory on the management node and modify the file content as required.
    • Take a800_AscendJob_{xxx}b.yaml as an example. Create a distributed training job on the Atlas 200T A2 Box16 heterogeneous subrack node. The job uses 2 x 4 processors. The modification example is as follows:
      apiVersion: mindxdl.gitee.com/v1
      kind: AscendJob
      metadata:
        name: default-test-mindspore
        labels:
          framework: mindspore     # Training framework name
      fault-scheduling: "grace" # Enable the graceful deletion mode
          ring-controller.atlas: ascend-{xxx}b
          fault-retry-times: "3"            # Enable unconditional retry upon service plane faults and set restartPolicy to Never.
          tor-affinity: "normal-schema"      #This label determines whether a job uses the switch affinity scheduling feature. If the value is null or the label is not specified, this feature is not used. large-model-schema indicates a foundation model job or padding job, and normal-schema indicates a common job.
          pod-rescheduling: "on"     # Enable pod-level rescheduling
          subHealthyStrategy: "ignore"  # Ignore nodes in subHealthy status. These nodes will not be used for affinity scheduling.
      spec:
        schedulerName: volcano   # This parameter is valid only when the startup parameter enableGangScheduling of Ascend Operator is set to true.
        runPolicy:
          backoffLimit: 3      # Number of job rescheduling times
          schedulingPolicy:
            minAvailable: 3 # Total number of job replicas
            queue: default     # Queue to which a job belongs
        successPolicy: AllWorkers   # Prerequisites for a successful job
        replicaSpecs:
          Scheduler:
           replicas: 1             #The value can only be 1.
            restartPolicy: Never   # Container restart policy
            template:
              metadata:
                labels:
                  ring-controller.atlas: ascend-{xxx}b  # Product type
              spec:
                terminationGracePeriodSeconds: 360  # Duration from the time when the container receives SIGTERM to the time when the container is forcibly stopped by Kubernetes
                nodeSelector:                       
                  host-arch: huawei-x86          # Atlas 200T A2 Box16 heterogeneous subrack has only the x86_64 architecture
                  accelerator-type: module-{xxx}b-16   # Node type
                containers:
                - name: ascend     # The value cannot be changed
      ...
                  ports:                     # (Optional) Collective communication port for distributed training
                    - containerPort: 2222    
                      name: ascendjob-port 
                  volumeMounts:
      ...
        
          Worker:
            replicas: 2
            restartPolicy: Never  # Container restart policy
            template:
              metadata:
                labels:
                  ring-controller.atlas: ascend-{xxx}b  # Product type
              spec:
                terminationGracePeriodSeconds: 360  # Duration from the time when the container receives SIGTERM to the time when the container is forcibly stopped by Kubernetes
                affinity:
      ...
                nodeSelector:           
                  host-arch: huawei-x86      # Atlas 200T A2 Box16 heterogeneous subrack has only the x86_64 architecture
                  accelerator-type: module-{xxx}b-16   # Node type
                containers:
                   name: ascend    # The value cannot be changed
      ...
                  env:
                  - name: ASCEND_VISIBLE_DEVICES
                    valueFrom:
                      fieldRef:
                        fieldPath: metadata.annotations['huawei.com/Ascend910']         # The value must be the same as that of resources and requests.
      ...
      
                  ports:        # (Optional) Collective communication port for distributed training
                    - containerPort: 2222    
                      name: ascendjob-port  
                  resources:
                    limits:
                      huawei.com/Ascend910: 4      # The number of required NPUs is 4
                    requests:
                      huawei.com/Ascend910: 4      # The value must be the same as that of limits.
    • The following uses a800_vcjob.yaml as an example to describe how to create a single-server training job on an Atlas 800 training server node. The job uses eight processors. The modification example is as follows:
      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: rings-config-mindx-dls-test     # The name after rings-config- must be the same as the job name.
      ...
        labels:
          ring-controller.atlas: ascend-910  # Identifies the type of the product
      ...
      ---
      apiVersion: batch.volcano.sh/v1alpha1   # The value cannot be changed. This Volcano API must be used.
      kind: Job                               # Only the job type is supported at present.
      metadata:
        name: mindx-dls-test                  # Job name, which can be customized.
        labels:
          ring-controller.atlas: ascend-910   
          fault-scheduling: "force"        # Enable the forcible deletion mode.
          fault-retry-times: "3"            # Enable unconditional retry upon service plane faults. Set restartPolicy to Never at the same time, and set event to PodFailed and action to Ignore under policies.
          tor-affinity: "normal-schema" # This label determines whether a job uses the switch affinity scheduling feature. If the value is null or the label is not specified, this feature is not used. large-model-schema indicates a foundation model job or padding job, and normal-schema indicates a common job.
          pod-rescheduling: "on"     # Enable pod-level rescheduling
          subHealthyStrategy: "ignore"   # Ignore nodes in subHealthy status. These nodes will not be used for affinity scheduling.
      ...
      spec:
        policies:  # To use pod-level rescheduling, delete policies and its sub-parameters of event and action.
          - event: PodEvicted   # If unconditional retry upon a service plane fault is used (or both pod-level rescheduling and unconditional retry upon a service plane fault are used), set event to PodFailed.
            action: RestartJob  # If unconditional retry upon a service plane fault is used (or both pod-level rescheduling and unconditional retry upon a service plane fault are used), set action to Ignore.
      ...
        minAvailable: 1                  # The value is 1 for a single server.
      ...
        maxRetry: 3              # Number of rescheduling times
      ...
        - name: "default-test"
            replicas: 1                  # The value is 1 for a single server.
            template:
              metadata:
      ...
              spec:
                terminationGracePeriodSeconds: 360  # Duration from the time when the container receives SIGTERM to the time when the container is forcibly stopped by Kubernetes
      ...
                  env:
      ...
                - name: ASCEND_VISIBLE_DEVICES                       # This field is used by Ascend Docker Runtime.
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.annotations['huawei.com/Ascend910']               # Must be the same as that of resources and requests below.
      ...
                  resources:  
                    requests:
                      huawei.com/Ascend910: 8          # The number of required NPUs is 8. You can add lines below to configure resources such as memory and CPU.
                    limits:
                      huawei.com/Ascend910: 8          # The value must be consistent with that in requests.
      ...
                  nodeSelector:
                    host-arch: huawei-arm               # Optional value. Set it as required.
                    accelerator-type: module      # Schedule to the Atlas 800 training server node
      ...
              restartPolicy: Never   # Container restart policy
  2. Configure the communication address of MindIO. Add the following information in bold to the code:
    ...
       Master:
    ...
                env:        
                  - name: POD_IP
                    valueFrom:
                      fieldRef:
                        fieldPath: status.podIP             # Used for MindIO communication. If this parameter is not set, the training job cannot be started properly.
  3. (Optional) If dying gasp is enabled, add the port information for dying gasp communication to the training YAML file. The following uses pytorch_multinodes_acjob_{xxx}b.yaml as an example. Add the following information in bold:
    ...
       Master:
    ...
              env:
                  - name: TTP_PORT                  
                    value: "8000"    #  Used for dying gasp communication. Ensure that the value is consistent.
    ...
                ports:                         
                    - containerPort: 2222        
                    name: ascendjob-port       
                  - containerPort: 8000         #  Used for dying gasp communication. Ensure that the value is consistent.
                    name: ttp-port
                  - containerPort: 9601     # Port for communication between TaskD pods.
                    name: taskd-port
    ...
       Worker:
    ...
              env:
                  - name: TTP_PORT                  
                    value: "8000"             #  Used for dying gasp communication. Ensure that the value is consistent.
    ...
                ports:                          
                    - containerPort: 2222         
                    name: ascendjob-port       
                  - containerPort: 8000         #  Used for dying gasp communication. Ensure that the value is consistent.
                    name: ttp-port
                  - containerPort: 9601     # Port for communication between TaskD pods.
                    name: taskd-port
    
    ...
  4. (Optional) If dying gasp and process-level recovery are used, add information such as the dying gasp communication port and the process-level recovery switch to the training YAML file. The following uses pytorch_multinodes_acjob_{xxx}b.yaml as an example. Add the following content in bold:
    ...
      labels:    
           framework: pytorch   
           ring-controller.atlas: ascend-{xxx}b    
           fault-scheduling: "grace"    
           fault-retry-times: "10"   // Enable the unconditional retry function.
           pod-rescheduling: "on"   //  Enable pod-level rescheduling.
           tor-affinity: "null" # This label determines whether a job uses the switch affinity scheduling feature. If the value is null or the label is not specified, this feature is not used. large-model-schema indicates a foundation model job or padding job, and normal-schema indicates a common job.
    ...
      annotations:  
         ...  
         recover-strategy: "recover,dump"
      replicaSpecs:    
          Master:     
            replicas: 1      
            restartPolicy: Never      
            template:        
                metadata:
    ...
               - name: TTP_PORT
                 value: "8000"  # Used for MindIO communication. Ensure that the value is consistent.
            command:                           # training command, which can be modified             
              - /bin/bash              
              - -c            
            args:
              - | 
                cd /job/code; 
                chmod +x scripts/train_start.sh; 
                bash scripts/train_start.sh
             ports:                          # default value 
               - containerPort: 2222 
                 name: ascendjob-port if not set              
              - containerPort: 8000    # Used for MindIO communication. Ensure that the value is consistent.
               name: ttp-port
              - containerPort: 9601    # Port for communication between TaskD pods.
               name: taskd-port
    ...
    
    ...
      replicaSpecs:    
          Worker:     
            replicas: 1      
            restartPolicy: Never      
            template:        
                metadata:
    ...
                - name: TTP_PORT
                value: "8000"  # Used for MindIO communication. Ensure that the value is consistent.
            command:                           # training command, which can be modified             
              - /bin/bash              
              - -c            
            args:
              - | 
                cd /job/code; 
                chmod +x scripts/train_start.sh; 
                bash scripts/train_start.sh
             ports:                          # default value 
               - containerPort: 2222 
                 name: ascendjob-port if not set              
              - containerPort: 8000    # Used for MindIO communication. Ensure that the value is consistent.
               name: ttp-port
              - containerPort: 9601    # Port for communication between TaskD pods.
               name: taskd-port
    ...
  5. To use the resumable training function, you are advised to expand the memory and add parameters based on the comments. The following is an example:
    ...
              volumeMounts:                             # Scale-out for resumable training
             - name: shm
               mountPath: /dev/shm
            volumes:
            - name: shm
              emptyDir:
                medium: Memory
                sizeLimit: 16Gi
    ...
  6. To configure CPU and memory resources, manually add the cpu and memory parameters and their values by referring to the following example. Set the specific values as required.
    ...
              resources:  
                requests:
                  huawei.com/Ascend910: 8
                  cpu: 100m               
                  memory: 100Gi           
                limits:
                  huawei.com/Ascend910: 8
                  cpu: 100m
                  memory: 100Gi
    ...
  7. Modify the mount paths of the training script and code.

    The base image pulled from Ascend image repository does not contain files such as the training script and code. During training, these files are usually mounted and mapped to a container.

              volumeMounts:
              - name: ascend-910-config
                mountPath: /user/serverid/devindex/config
              - name: code
                mountPath: /job/code                     # Path of the training script in the container.
              - name: data
                mountPath: /job/data                      # Path of the training dataset in the container.
              - name: output
                mountPath: /job/output                    # Path of the training output in the container.
  8. (Optional) As shown below, the three parameters following the bash train_start.sh training command in the YAML file are the directory of the training code in the container, the output directory (contains the generated log redirection file and TensorFlow model file), and the relative path of the startup script to the code directory (the startup script is not involved for PyTorch command parameters). The subsequent parameters starting with -- are required by the training script. For details about how to modify the single-server and distributed training scripts and script parameters, see the model description in the model script source.

    Skip this step if you use the graceful fault tolerance mode.

    • TensorFlow command parameters
      command:
      - "/bin/bash"
      - "-c"
      - "cd /job/code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ /job/output/ tensorflow/resnet_ctl_imagenet_main.py --data_dir=/job/data/imagenet_TF --distribution_strategy=one_device --use_tf_while_loop=true --epochs_between_evals=1 --skip_eval --enable_checkpoint_and_export;"
      ...
    • PyTorch command parameters
      command:
      - "/bin/bash"
      - "-c"
      - "cd /job/code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ /job/output/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --lr=1.6 --dist-backend='hccl' --multiprocessing-distributed --epochs=90 --batch-size=1024 --resume=true;"
      ...
    • Skip this step for models that use the MindSpore architecture, including the ResNet-50 and Pangu_alpha models.
  9. Set a storage mode as required.
    • (Optional) If the NFS is used, specify the NFS server address, training dataset path, script path, and training output path in the YAML file as required. If the NFS is not used, modify the configuration based on the Kubernetes guide.

      Do not use the ConfigMap to mount the RankTable file. Otherwise, job rescheduling may fail.

      ...
                volumeMounts:
                - name: ascend-910-config
                  mountPath: /user/serverid/devindex/config
                - name: code
                  mountPath: /job/code                     # Path of the training script in the container.
                - name: data
                  mountPath: /job/data                      # Path of the training dataset in the container.
                - name: output
                  mountPath: /job/output                    # Path of the training output in the container.
      ...
                 # Optional. Use Ascend Operator to generate the RankTable file for a training job. Add the following fields in bold to set the path for storing the hccl.json file in the container. The path cannot be modified.
                - name: ranktable        
                  mountPath: /user/serverid/devindex/config
      ...
              volumes:
      ...
              - name: code
                nfs:
                  server: 127.0.0.1        # IP address of the NFS server.
                  path: "xxxxxx"           # Training script path.
              - name: data
                nfs:
                  server: 127.0.0.1
                  path: "xxxxxx"           # Training dataset path.
              - name: output
                nfs:
                  server: 127.0.0.1
                  path: "xxxxxx"           # Path for saving the script-related model.
      ...
                 # Optional. Generate a RankTable file for the PyTorch framework through the necessary component. Add the following fields in bold to set the path for storing the hccl.json file.
                - name: ranktable         # Do not change the default value of this parameter. Ascend Operator checks whether the hccl.json file is mounted.
                hostPath:                   # Use a host path or NFS for mounting.
                  path: /user/mindx-dl/ranktable/default.default-test-pytorch   # shared storage or local storage path. /user/mindx-dl/ranktable/ is the prefix of the path, which must be the same as the RankTable root directory mounted to the Ascend Operator. default.default-test-pytorch is the suffix of the path. You are advised to change it to namespace.job-name.
      ...
    • (Optional) If the local storage mounting mode is used, change the NFS mode in the YAML file to hostPath.
                volumes:
                - name: code
                  hostPath:                                                        # Change it to local storage.
                    path: "/data/atlas_dls/code/resnet/"
                - name: data
                  hostPath:                                                        # Change it to local storage.
                    path: "/data/atlas_dls/public/dataset/"
                - name: output
                  hostPath:                                                        # Change it to local storage.
                    path: "/data/atlas_dls/output/"
                - name: ascend-driver
                  hostPath:
                    path: /usr/local/Ascend/driver
                - name: dshm
                  emptyDir:
                    medium: Memory
                - name: localtime
                  hostPath:
                    path: /etc/localtime

(Optional) Modifying the Training Script

If the graceExit policy is enabled, modify the job YAML file and set the fault recovery policy to dump to ensure that TaskD and ClusterD can be used properly.

...  
  labels:  
     ... 
     subHealthyStrategy: "graceExit"
...
   annotations:  
     ...  
     recover-strategy: "dump"
...