YAML Configuration

This section describes how to configure YAML files for full NPU scheduling or static vNPU scheduling. For details about how to configure resource information using environment variables or files, see Resource Information Configuration Using Environment Variables or Resource Information Configuration Using a File.

Resource Information Configuration Using Environment Variables

In this scenario, you can perform the following operations only after creating the mount path of the hccl.json file. For details, see Step 4.

  1. Upload the YAML file to any directory on the management node and modify the file content as required.
    • Refer to this configuration when using the full NPU scheduling feature. The following uses tensorflow_standalone_acjob.yaml as an example to describe how to create a single-server training job on an Atlas 800 training server. The job uses eight processors. The modification example is as follows.
      apiVersion: mindxdl.gitee.com/v1
      kind: AscendJob
      metadata:
        name: default-test-tensorflow
        labels:
          framework: tensorflow  # Training framework
      spec:
        schedulerName: volcano        # This field is valid only when the startup parameter enableGangScheduling of Ascend Operator is set to true.
        runPolicy:
          schedulingPolicy:           # This field is valid only when the startup parameter enableGangScheduling of Ascend Operator is set to true.
            minAvailable: 1        # Total number of job replicas
            queue: default         # Queue to which a job belongs
        successPolicy: AllWorkers    # Prerequisites for a successful job
        replicaSpecs:
          Chief:
            replicas: 1      # Number of job replicas
            restartPolicy: Never
            template:
              spec:
                nodeSelector:
                  host-arch: huawei-arm               # (Optional) Set it as required.
                  accelerator-type: module           # Node type
                containers:
                - name: ascend                          # The value must be ascend and cannot be changed.
                  image: tensorflow-test:latest        # Image name
      ...
                env:
      ...
                - name: ASCEND_VISIBLE_DEVICES                       # This field is used by Ascend Docker Runtime.
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.annotations['huawei.com/Ascend910']               # The value must be consistent with resources.requests below.
      ...
                 ports:                           # Collective communication port for distributed training
                    - containerPort: 2222         
                      name: ascendjob-port
                  resources:
                    limits:
                      huawei.com/Ascend910: 8 # Number of applied processors
                    requests:
                      huawei.com/Ascend910: 8 #The value is the same as that of limits.
      ...

      After the modification is complete, go to Step 2 to configure other fields of the YAML file.

    • Refer to this configuration when using the full NPU scheduling feature. The switch affinity scheduling feature is added to PyTorch and MindSpore. This feature supports both foundation model jobs and common jobs. The following uses pytorch_standalone_acjob.yaml as an example to describe how to create a single-server training job on an Atlas 800 training server. The job uses one processor. The modification example is as follows.
      apiVersion: mindxdl.gitee.com/v1
      kind: AscendJob
      metadata:
        name: default-test-pytorch
        labels:
          framework: pytorch   # Image name
          tor-affinity: "normal-schema" # This label determines whether a job uses the switch affinity scheduling feature. If the value is null or the label is not specified, this feature is not used. large-model-schema indicates a foundation model job or padding job, and normal-schema indicates a common job.
      spec:
        schedulerName: volcano  # This field is valid only when the startup parameter enableGangScheduling of Ascend Operator is set to true.
        runPolicy:
          schedulingPolicy:           # This field is valid only when the startup parameter enableGangScheduling of Ascend Operator is set to true.
            minAvailable: 1    # Total number of job replicas
            queue: default    # Queue to which a job belongs
        successPolicy: AllWorkers   # Prerequisites for a successful job
        replicaSpecs:
          Master:
            replicas: 
            restartPolicy: Never
            template:
              spec:
                nodeSelector:
                  host-arch: huawei-arm               # (Optional) Set it as required.
                  accelerator-type: module         # Node type
                containers:
                - name: ascend                    # The value must be ascend and cannot be changed.
                image: PyTorch-test:latest       # Image name
      ...
                env:
      ...
                - name: ASCEND_VISIBLE_DEVICES                       # This field is used by Ascend Docker Runtime.
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.annotations['huawei.com/Ascend910']               # The value must be consistent with resources.requests below.
      ...
                  ports:                          # Collective communication port for distributed training
                    - containerPort: 2222         
                      name: ascendjob-port
                  resources:
                    limits:
                      huawei.com/Ascend910: 1    # Number of applied processors for a job
                    requests:
                      huawei.com/Ascend910: 1   # The value is the same as that of limits.
      ...

      After the modification is complete, go to Step 2 to configure other fields of the YAML file.

      In the TensorFlow, PyTorch, and MindSpore frameworks, the value of replicas of Chief, Master, and Scheduler cannot exceed 1. For a single-server job, TensorFlow and PyTorch do not require Worker. For a single-processor job, MindSpore does not require Scheduler.

    • Refer to this configuration when using the full NPU scheduling feature. The following uses tensorflow_multinodes_acjob_{xxx}b.yaml as an example to describe how to create a distributed training job on two Atlas 800T A2 training servers. The job uses 2 × 8 processors and each pod of the distributed job can be scheduled to different nodes. The modification example is as follows.
      apiVersion: mindxdl.gitee.com/v1
      kind: AscendJob
      metadata:
        name: default-test-tensorflow        # Job name
        labels:
          framework: tensorflow     # Training framework name
          ring-controller.atlas: ascend-{xxx}b  # Product type
      spec:
        schedulerName: volcano  # This field is valid only when the startup parameter enableGangScheduling of Ascend Operator is set to true.
        runPolicy:
          schedulingPolicy:           # This field is valid only when the startup parameter enableGangScheduling of Ascend Operator is set to true.
            minAvailable: 2   # Total number of job replicas
            queue: default     # Queue to which a job belongs
        successPolicy: AllWorkers  # Prerequisites for a successful job
        replicaSpecs:
          Chief:
            replicas: 1   # Number of job replicas
            restartPolicy: Never
            template:
              metadata:
                labels:
                  ring-controller.atlas: ascend-{xxx}b  # Product type
              spec:
                affinity:                                         # This configuration indicates that pods of the distributed job are scheduled to different nodes.
                  podAntiAffinity:
                    requiredDuringSchedulingIgnoredDuringExecution:
                      - labelSelector:
                          matchExpressions:
                            - key: job-name
                              operator: In
                              values:
                                - default-test-tensorflow         # The value must be consistent with the preceding job name.
                        topologyKey: kubernetes.io/hostname
                nodeSelector:
                  host-arch: huawei-arm               # (Optional) Set it as required.
                  accelerator-type: module-{xxx}b-8   # Node type
                containers:
                - name: ascend                                          # The value must be ascend and cannot be changed.
                image: tensorflow-test:latest  #Image name
      ...
                  resources:
                    limits:
                      huawei.com/Ascend910: 8     # Number of allocated processors
                    requests:
                      huawei.com/Ascend910: 8     # The value is the same as that of limits.
                  volumeMounts:
      ...
                volumes:
      ...
          Worker:
            replicas: 1   # Number of job replicas
            restartPolicy: Never
            template:
              metadata:
                labels:
                  ring-controller.atlas: ascend-{xxx}b   # Product type
              spec:
                affinity:            # This configuration indicates that pods of the distributed job are scheduled to different nodes.
                  podAntiAffinity:
                    requiredDuringSchedulingIgnoredDuringExecution:
                      - labelSelector:
                          matchExpressions:
                            - key: job-name
                              operator: In
                              values:
                                - default-test-tensorflow        # The value must be consistent with the preceding job name.
                        topologyKey: kubernetes.io/hostname
                nodeSelector:
                  host-arch: huawei-arm               # (Optional) Set it as required.
                  accelerator-type: module-{xxx}b-8  # Node type
                containers:
                - name: ascend                                   # The value must be ascend and cannot be changed.
      ...
                env:
      ...
                - name: ASCEND_VISIBLE_DEVICES                       # This field is used by Ascend Docker Runtime.
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.annotations['huawei.com/Ascend910']               # The value must be consistent with resources.requests below.
      ...
                  ports:                          # Collective communication port for distributed training
                    - containerPort: 2222         
                      name: ascendjob-port
                  resources:
                    limits:
                      huawei.com/Ascend910: 8   # Number of allocated processors for the job
                    requests:
                      huawei.com/Ascend910: 8   # The value is the same as that of limits.
                  volumeMounts:
      ...
                volumes:
      ...

      After the modification is complete, go to Step 2 to configure other fields of the YAML file.

    • Refer to this configuration when using the full NPU scheduling feature. The following uses pytorch_standalone_acjob_super_pod.yaml as an example to describe how to create a single-server training job on an Atlas 900 A3 SuperPoD. The modification example is as follows.
      apiVersion: mindxdl.gitee.com/v1
      kind: AscendJob
      metadata:
        name: default-test-pytorch
        labels:
          framework: pytorch    # Framework type
          ring-controller.atlas: ascend-{xxx}b  # Product type
          podgroup-sched-enable: "true" # Configured only when the openFuyao-customized Kubernetes and volcano-ext are used in the cluster. If the value is true, batch scheduling is enabled. If another value is used or this parameter is not set, batch scheduling is disabled and common scheduling is used.
        annotations:
          sp-block: "16"  # The value must be the same as the number of allocated processors.
      spec:
        schedulerName: volcano  # This field is valid only when the startup parameter enableGangScheduling of Ascend Operator is set to true.
        runPolicy:
          schedulingPolicy:           # This field is valid only when the startup parameter enableGangScheduling of Ascend Operator is set to true.
            minAvailable: 1     # Total number of job replicas
            queue: default  # Queue to which a job belongs
        successPolicy: AllWorkers     # Prerequisites for a successful job
        replicaSpecs:
          Master:
            replicas: 1   # Number of job replicas
            restartPolicy: Never
            template:
              metadata:
                labels:
                  ring-controller.atlas: ascend-{xxx}b
              spec:
                nodeSelector:
                  host-arch: huawei-arm      # (Optional) Set it as required.
                  accelerator-type: module-a3-16-super-pod    # Node type
                containers:
                - name: ascend  # The value must be ascend and cannot be changed.
                  image: pytorch-test:latest      # Training base image
                  imagePullPolicy: IfNotPresent
                  env:
      ...
                   - name: ASCEND_VISIBLE_DEVICES     # This field is used by Ascend Docker Runtime.
                      valueFrom:
                        fieldRef:
                          fieldPath: metadata.annotations['huawei.com/Ascend910']               
      ...
                  ports:                     # Collective communication port for distributed training
                    - containerPort: 2222         # determined by user
                      name: ascendjob-port        # do not modify
                  resources:
                    limits:
                      huawei.com/Ascend910: 16 # Number of allocated processors for the job
                    requests:
                      huawei.com/Ascend910: 16   # The value is the same as that of limits.
      ...

      After the modification is complete, go to Step 2 to configure other fields of the YAML file.

    • Refer to this configuration when using the static vNPU scheduling feature. The following uses tensorflow_standalone_acjob.yaml as an example to describe how to create a single-server training job on an Atlas 800 training server. Assume that two AI Cores are allocated to the job. The modification example is as follows. Static vNPU scheduling supports only single-server training jobs.
      apiVersion: mindxdl.gitee.com/v1
      kind: AscendJob
      metadata:
        name: default-test-tensorflow
        labels:
          framework: tensorflow  # Training framework
          ring-controller.atlas: ascend-910  # Processor type
      spec:
        schedulerName: volcano        # This field is valid only when the startup parameter enableGangScheduling of Ascend Operator is set to true.
        runPolicy:
          schedulingPolicy:           # This field is valid only when the startup parameter enableGangScheduling of Ascend Operator is set to true.
            minAvailable: 1   # Total number of job replicas
            queue: default   # Queue to which a job belongs
        successPolicy: AllWorkers  # Prerequisites for a successful job
        replicaSpecs:
          Chief:
            replicas: 1 # Number of job replicas
            restartPolicy: Never
            template:
              metadata:
                labels:
                  ring-controller.atlas: ascend-910   # Processor type
              spec:
                nodeSelector:
                  host-arch: huawei-arm               # (Optional) Set it as required.
                  accelerator-type: module-{xxx}b-8  # Node type
                containers:
                - name: ascend                          # The value must be ascend and cannot be changed.
                image: tensorflow-test:latest       # Image name
      ...
                env:
      ...
               # ASCEND_VISIBLE_DEVICES is not supported by static vNPU scheduling. Delete the following fields in bold:
                - name: ASCEND_VISIBLE_DEVICES                       
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.annotations['huawei.com/Ascend910']               
      ...
                  ports:                 # Collective communication port for distributed training
                    - containerPort: 2222         
                      name: ascendjob-port
                  resources:
                    limits:
                      huawei.com/Ascend910-2c: 1 # The number must be 1 for vNPU scheduling.
                    requests:
                      huawei.com/Ascend910-2c: 1 # The number must be 1 for vNPU scheduling.
                  volumeMounts:
      ...

      After the modification is complete, go to Step 2 to configure other fields of the YAML file.

    • Refer to this configuration when using the full NPU scheduling feature. The following uses mindspore_multinodes_acjob_{xxx}b.yaml as an example to describe how to execute a training job by mounting processors to a scheduler on an Atlas 800T A2 training server. The job uses 2 × 8 processors. The modification example is as follows.
      apiVersion: mindxdl.gitee.com/v1
      kind: AscendJob
      metadata:
        name: default-test-mindspore
        labels:
          framework: mindspore     # Training framework name
          ring-controller.atlas: ascend-{xxx}b  # Product type
      spec:
        schedulerName: volcano    # This field is valid when the startup parameter enableGangScheduling of Ascend Operator is set to true.
        runPolicy:
          schedulingPolicy:      # This field is valid when the startup parameter enableGangScheduling of Ascend Operator is set to true.
            minAvailable: 2  #Total number of job replicas
           queue: default      # Queue to which a job belongs
        successPolicy: AllWorkers  #Prerequisites for a successful job
        replicaSpecs:
          Scheduler:
            replicas: 1   # Number of job replicas
            restartPolicy: Never
            template:
              metadata:
                labels:
                  ring-controller.atlas: ascend-{xxx}b  # Product type
              spec:
                hostNetwork: true    # Optional. Set this parameter as required. true indicates that the host IP can be used to create a pod, and false indicates that the host IP cannot be used to create a pod.
                affinity:                                         #  This configuration indicates that pods of a distributed job are scheduled to different nodes.
                  podAntiAffinity:
                    requiredDuringSchedulingIgnoredDuringExecution:
                      - labelSelector:
                          matchExpressions:
                            - key: job-name
                              operator: In
                              values:
                                - default-test-mindspore         # The value must be consistent with the preceding job name.
                        topologyKey: kubernetes.io/hostname
                nodeSelector:
                  host-arch: huawei-arm              # (Optional) Set it as required.
                  accelerator-type: module-{xxx}b-8   # Node type
                containers:
                - name: ascend                                     # The value must be ascend and cannot be changed.
                  image: mindspore-test:latest  #Image name
                  imagePullPolicy: IfNotPresent
      ...
                  env:                                    
                    - name: HCCL_IF_IP                    # (Optional) Set this parameter as required.
                      valueFrom:                          # If hostNetwork is set to true, also configure the HCCL_IF_IP environment variable.
                        fieldRef:                         # If hostNetwork is not configured or is set to false, the HCCL_IF_IP environment variable cannot be configured.
                          fieldPath: status.hostIP        # 
      ...            
                  ports:                          # Collective communication port for distributed training
                    - containerPort: 2222         
                      name: ascendjob-port
                  resources:
                    limits:
                      huawei.com/Ascend910: 8 # Number of allocated processors
                    requests:
                      huawei.com/Ascend910: 8 # The value is the same as that of limits.
                  volumeMounts:
      ...            
                volumes:
      ...            
          Worker:
            replicas: 1   #Number of job replicas
            restartPolicy: Never
            template:
              metadata:
                labels:
                  ring-controller.atlas: ascend-{xxx}b   # Product type
              spec:
                hostNetwork: true    # Optional. Set this parameter as required. true indicates that the host IP can be used to create a pod, and false indicates that the host IP cannot be used to create a pod.
                affinity:            # This configuration indicates that pods of a distributed job are scheduled to different nodes.
                  podAntiAffinity:
                    requiredDuringSchedulingIgnoredDuringExecution:
                      - labelSelector:
                          matchExpressions:
                            - key: job-name
                              operator: In
                              values:
                                - default-test-mindspore        # The value must be consistent with the preceding job name.
                        topologyKey: kubernetes.io/hostname
                nodeSelector:
                  host-arch: huawei-arm              # (Optional) Set it as required.
                  accelerator-type: module-{xxx}b-8  # Node type
                containers:
                - name: ascend                            # The value must be ascend and cannot be changed.
      ...
                  env:                                    
                    - name: HCCL_IF_IP                    # (Optional) Set this parameter as required.
                      valueFrom:                          # If hostNetwork is set to true, also configure the HCCL_IF_IP environment variable.
                        fieldRef:                         # If hostNetwork is not configured or is set to false, the HCCL_IF_IP environment variable cannot be configured.
                          fieldPath: status.hostIP        # 
      ...
                - name: ASCEND_VISIBLE_DEVICES                       # This field is used by Ascend Docker Runtime.
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.annotations['huawei.com/Ascend910']                # The value must be consistent with resources.requests below.
      ...
                  ports:                          # Collective communication port for distributed training
                    - containerPort: 2222         
                      name: ascendjob-port
                  resources:
                    limits:
                      huawei.com/Ascend910: 8 # Number of allocated processors
                    requests:
                      huawei.com/Ascend910: 8 # The value is the same as that of limits.
                  volumeMounts:
      ...
                volumes:
      ...

    The operations for configuring the YAML files for full NPU scheduling or static vNPU scheduling are different only in step 1. The operations after step 1 are the same for both scheduling features.

  2. To configure CPU and memory resources, manually add the cpu and memory parameters and their values by referring to the following example. Set the specific values as required.
    ...
              resources:  
                requests:
                  huawei.com/Ascend910: 8
                  cpu: 100m            
                  memory: 100Gi      
                limits:
                  huawei.com/Ascend910: 8
                  cpu: 100m
                  memory: 100Gi
    ...
  3. Modify the mount paths of the training script and code.

    The base image pulled from Ascend image repository does not contain files such as the training script and code. During training, these files are usually mounted and mapped to a container.

              volumeMounts:
              - name: ascend-server-config
                mountPath: /user/serverid/devindex/config
              - name: code
                mountPath: /job/code                     # Path of the training script in the container.
              - name: data
                mountPath: /job/data                      # Path of the training dataset in the container.
              - name: output
                mountPath: /job/output                    # Path of the training output in the container.
  4. As shown below, the three parameters following bash train_start.sh in the YAML file are the directory of the training code in the container, the output directory (contains the generated log redirection file and TensorFlow model file), and the relative path of the startup script to the code directory. The subsequent parameters starting with -- are required by the training script. For details about how to modify the single-server and distributed training scripts and script parameters, see the model description in the model script source.
    • TensorFlow command parameters
         command:
        - /bin/bash
        - -c
      args: [ "cd /job/code/scripts; chmod +x train_start.sh; bash train_start.sh /job/code/ /job/output/ tensorflow/resnet_ctl_imagenet_main.py --data_dir=/job/data/resnet50/imagenet_TF/ --distribution_strategy=one_device --use_tf_while_loop=true --epochs_between_evals=1 --skip_eval --enable_checkpoint_and_export" ]
      ...
    • PyTorch command parameters
      command:
        - /bin/bash
        - -c
      args: ["cd /job/code/scripts; chmod +x train_start.sh; bash train_start.sh /job/code /job/output main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --epochs=90 --batch-size=512"]
      ...
    • MindSpore command parameters
      command:
        - /bin/bash
        - -c
      args: ["cd /job/code/scripts; chmod +x train_start.sh; bash train_start.sh /job/code/ /job/code/output train.py  --data_path=/job/data/resnet50/imagenet/train --config=/job/code/config/resnet50_imagenet2012_config.yaml"]
      ...
      The TensorFlow command parameters are used as an example.
      • /job/code/: path of the training script in the container, which is defined in Step 3.
      • /job/output/: path of the training dataset in the container, which is defined in Step 3.
      • tensorflow/resnet_ctl_imagenet_main.py: path of the training startup script.
  5. If the NFS is used, specify the NFS server address, training dataset path, script path, and training output path in the YAML file as required. If the NFS is not used, modify the configuration based on the Kubernetes guide.
    ...
              volumeMounts:
              - name: ascend-server-config
                mountPath: /user/serverid/devindex/config
              - name: code
                mountPath: /job/code                     # Path of the training script in the container.
              - name: data
                mountPath: /job/data                      # Path of the training dataset in the container.
              - name: output
                mountPath: /job/output                    # Path of the training output in the container.
    ...
               # Optional. Generate a RankTable file for the training job through the necessary component. Add the following fields in bold to set the path for storing the hccl.json file in the container. The path cannot be modified.
              - name: ranktable                                 
               mountPath: /user/serverid/devindex/config
    ...
            volumes:
    ...
            - name: code
              nfs:
                server: 127.0.0.1        # IP address of the NFS server.
                path: "xxxxxx"           # Training script path.
            - name: data
              nfs:
                server: 127.0.0.1
                path: "xxxxxx"           # Training dataset path.
            - name: output
              nfs:
                server: 127.0.0.1
                path: "xxxxxx"           # Set the path for saving the script-related model.
    ...
             # Optional. Generate RankTable files for PyTorch and MindSpore frameworks through the necessary component. Add the following fields in bold to set the path for storing the hccl.json file.
            - name: ranktable           # Do not change the default value of this parameter. Ascend Operator is used to check whether the hccl.json file is mounted.
              hostPath:                 # Use a host path or NFS for mounting.
                path: /user/mindx-dl/ranktable/default.default-test-pytorch   # shared storage or local storage path. /user/mindx-dl/ranktable/ is the prefix of the path, which must be the same as the RankTable root directory mounted by Ascend Operator. default.default-test-pytorch is the suffix of the path. You are advised to change it to namespace.job-name.

Resource Information Configuration Using a File

  1. Upload the YAML file to any directory on the management node and modify the file content as required.
    Table 2 Operation reference

    Feature

    Operation Example

    Full NPU scheduling

    Creating a Single-Server Job on an Atlas 800 Training Server

    Creating a Distributed Job on an Atlas 800 Training Server

    Full NPU scheduling

    Creating a Distributed Job on an Atlas 800T A2 Training Server

    NOTE:

    To use switch affinity scheduling supported by the PyTorch or MindSpore framework, see configuration example of switch affinity scheduling below.

    Static vNPU scheduling

    Creating a Single-Server Job on an Atlas 800 Training Server

    • Refer to this configuration when using the full NPU scheduling feature. The following uses a800_tensorflow_vcjob.yaml as an example to describe how to create a single-server training job on an Atlas 800 training server. The job uses eight processors. The modification example is as follows.
      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: rings-config-mindx-dls-test     # The name after rings-config- must be the same as the job name.
      ...
        labels:
          ring-controller.atlas: ascend-910    # Processor type used by a job
      ...
      ---
      apiVersion: batch.volcano.sh/v1alpha1   # The value cannot be changed. The Volcano API must be used.
      kind: Job                               # The type can only be job.
      metadata:
        name: mindx-dls-test                  # Job name, which can be customized.
      ...
      spec:
        minAvailable: 1                  # The value is 1 for a single server.
      ...
        - name: "default-test"
            replicas: 1                  # The value is 1 for a single server.
            template:
              metadata:
      ...
              spec:
      ...
                 containers:
                 - image: tensorflow-test:latest   # Image name
      ...
                   env:
      ...
                - name: ASCEND_VISIBLE_DEVICES                       # This field is used by Ascend Docker Runtime.
                     valueFrom:
                       fieldRef:
                      fieldPath: metadata.annotations['huawei.com/Ascend910']               # The value must be consistent with resources.requests below.
      ...
                  resources:  
                    requests:
                      huawei.com/Ascend910: 8          # The number of required NPUs is 8.
                    limits:
                      huawei.com/Ascend910: 8          # The value must be consistent with that in requests.
      ...
                  nodeSelector:
                    host-arch: huawei-arm              # (Optional) Set it as required.
                    accelerator-type: module        # Schedule to Atlas 800 training server.
      ...

      After the modification is complete, go to Step 2 to configure other fields of the YAML file.

    • Refer to this configuration when using the full NPU scheduling feature. The following uses a800_tensorflow_vcjob.yaml as an example to describe how to create a distributed training job on two Atlas 800 training servers. The distributed job uses 2 × 8 processors, and its pods can only be scheduled to different nodes. The modification example is as follows.
      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: rings-config-mindx-dls-test     # The name after rings-config- must be the same as the job name.
      ...
        labels:
          ring-controller.atlas: ascend-910    # Processor type used by a job
      ...
      ---
      apiVersion: batch.volcano.sh/v1alpha1   # The value cannot be changed. The Volcano API must be used.
      kind: Job                               # The type can only be job.
      metadata:
        name: mindx-dls-test                  # Job name, which can be customized.
      ...
      spec:
        minAvailable: 2                  # Set the value to 2 in a two-server scenario and to N in an N-server scenario. This parameter is not required for jobs of the Deployment type.
      ...
        - name: "default-test"
            replicas: 2                  # The value is N in an N-node distributed scenario.
            template:
              metadata:
      ...
              spec:
                affinity:                            # This configuration indicates that pods of a distributed job are scheduled to different nodes.
                  podAntiAffinity:
                    requiredDuringSchedulingIgnoredDuringExecution:
                      - labelSelector:
                          matchExpressions:
                            - key: volcano.sh/job-name      # Fixed field for vcjob. When the job type is Deployment, the key is deploy-name.
                              operator: In                   # Fixed field.
                              values:
                                - mindx-dls-test             # The value must be consistent with the preceding job name.
                        topologyKey: kubernetes.io/hostname
              containers:
              - image: tensorflow-test:latest  # Image name
      ...
                env:
      ...
                - name: ASCEND_VISIBLE_DEVICES                       # This field is used by Ascend Docker Runtime.
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.annotations['huawei.com/Ascend910']               # The value must be consistent with resources.requests below.
      
                  resources:  
                    requests:
                      huawei.com/Ascend910: 8          # The number of required NPUs is 8. You can add lines below to configure resources such as memory and CPU.
                    limits:
                      huawei.com/Ascend910: 8          # The value must be consistent with that in requests.
      ...
                  nodeSelector:
                    host-arch: huawei-arm              # (Optional) Set it as required.
                    accelerator-type: module     # Schedule to Atlas 800 training server.
      ...

      After the modification is complete, go to Step 2 to configure other fields of the YAML file.

    • Refer to this configuration when using the full NPU scheduling feature. The following uses a800_tensorflow_vcjob.yaml as an example to describe how to create a distributed training job on two Atlas 800T A2 training servers. The distributed job uses 2 × 8 processors, and its pods can only be scheduled to different nodes. The modification example is as follows.
      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: rings-config-mindx-dls-test     # The name after rings-config- must be the same as the job name.
      ...
        labels:
          ring-controller.atlas: ascend-{xxx}b   # Product type
      ..
      ---
      apiVersion: batch.volcano.sh/v1alpha1   # The value cannot be changed. The Volcano API must be used.
      kind: Job                               # The type can only be job.
      metadata:
        name: mindx-dls-test                  # Job name
      ...
        labels:
          ring-controller.atlas: ascend-{xxx}b   # The value must be the same as the label in the ConfigMap and cannot be changed.
      ...
      spec:
        minAvailable: 2                      # It is recommended that the value be the same as the number of nodes.
        schedulerName: volcano                # Use Volcano for scheduling.
      ...
        tasks:
        - name: "default-test"
          replicas: 2                         # Number of nodes
          template:
            metadata:
              labels:
                app: tf
                ring-controller.atlas: ascend-{xxx}b  # The value must be the same as the label in the ConfigMap and cannot be changed.
            spec:
              affinity:                                   # This configuration indicates that pods of the distributed job are scheduled to different nodes.
                podAntiAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                    - labelSelector:
                        matchExpressions:
                          - key: volcano.sh/job-name      # Fixed field for vcjob. When the job type is Deployment, the key is deploy-name.
                            operator: In                   # Fixed field.
                            values:
                              - mindx-dls-test             # The value must be consistent with the preceding job name.
                      topologyKey: kubernetes.io/hostname
              containers:
              - image: tensorflow-test:latest               # Training framework image. Change it as required.
      ...
                env:
      ...
                - name: XDL_IP                 # This field is fixed.
                  valueFrom:
                    fieldRef:
                      fieldPath: status.hostIP
                - name: framework
                  value: "Tensorflow"          # Change the value based on the specific framework.
                - name: ASCEND_VISIBLE_DEVICES                       # This field is required by Ascend Docker Runtime.
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.annotations['huawei.com/Ascend910']               # The value must be consistent with resources.requests below.
      ...
                resources:
                  requests:
                    huawei.com/Ascend910: 8    # Each Atlas 800T A2 training server supports a maximum of eight processors.
                  limits:
                    huawei.com/Ascend910: 8    # Each Atlas 800T A2 training server supports a maximum of eight processors.
      ...
              nodeSelector:
                host-arch: huawei-arm              # (Optional) Set it as required.
                accelerator-type: module-{xxx}b-8          # Scheduling to Atlas 800T A2 training server.
      ...

      After the modification is complete, go to Step 2 to configure other fields of the YAML file.

    • Refer to this configuration when using the full NPU scheduling feature. The switch affinity scheduling feature is added to the PyTorch and MindSpore frameworks. This feature supports both foundation model jobs and common jobs. The following uses a800_pytorch_vcjob.yaml as an example to describe how to create a distributed training job on an Atlas 800T A2 training server. The job uses eight processors. The modification example is as follows.
      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: rings-config-mindx-dls-test     # The name after rings-config- must be the same as the job name.
        namespace: vcjob                      
        labels:
          ring-controller.atlas: ascend-{xxx}b   # Product type
      data:
        hccl.json: |
          {
              "status":"initializing"
          }
      ---
      apiVersion: batch.volcano.sh/v1alpha1   # The value cannot be changed. The Volcano API must be used.
      kind: Job                               # The type can only be job.
      metadata:
      ...                 
        labels:
          ring-controller.atlas: ascend-{xxx}b   # The value must be the same as the label in the ConfigMap and cannot be changed.
          fault-scheduling: "force"
          tor-affinity: "normal-schema"      # This label determines whether a job uses the switch affinity scheduling feature. If the value is null or the label is not specified, this feature is not used. large-model-schema indicates a foundation model job or padding job, and normal-schema indicates a common job.
      spec:
        minAvailable: 1                      # It is recommended that the value be the same as the number of nodes.
        schedulerName: volcano                # Use Volcano for scheduling.
      ...
        tasks:
        - name: "default-test"
          replicas: 1                              # Number of nodes
          template:
            metadata:
              labels:
                app: pytorch
                ring-controller.atlas: ascend-{xxx}b  # The value must be the same as the label in the ConfigMap and cannot be changed.
            spec:
                affinity:                            # This configuration indicates that pods of the distributed job are scheduled to different nodes.
                podAntiAffinity:
                  requiredDuringSchedulingIgnoredDuringExecution:
                    - labelSelector:
                        matchExpressions:
                          - key: volcano.sh/job-name
                            operator: In
                            values:
                              - mindx-dls-test
                      topologyKey: kubernetes.io/hostname
              hostNetwork: true
              containers:
              - image: torch:b030               # Training framework image. Change it as required.
                - name: XDL_IP                  # This field is fixed.
                  valueFrom:
                    fieldRef:
                      fieldPath: status.hostIP
                - name: POD_UID
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.uid
                - name: framework
                  value: "PyTorch"
      ...
                - name: ASCEND_VISIBLE_DEVICES
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.annotations['huawei.com/Ascend910']               # The value must be consistent with resources.requests below.
      ...
                resources:
                  requests:
                    huawei.com/Ascend910: 8 # Each Atlas 800T A2 training server supports a maximum of eight processors.
                  limits:
                    huawei.com/Ascend910: 8 # Each Atlas 800T A2 training server supports a maximum of eight processors.
       ...
              nodeSelector:
                host-arch: huawei-x86        # (Optional) Set it as required.
                accelerator-type: module-{xxx}b-8  # Schedule to Atlas 800T A2 training server.
      ...

      For details about other examples, see Table 5 and Table 4. In addition, refer to YAML parameter descriptions in Table 2 for example modification and adaptation. After the modification is complete, go to Step 2 to configure other fields of the YAML file.

    • Refer to this configuration when using the static vNPU scheduling feature. The following uses a800_tensorflow_vcjob.yaml as an example to describe how to create a single-server training job on an Atlas 800 training server. Assume that two AI Cores are allocated to the job. The modification example is as follows. Static vNPU scheduling supports only single-server training jobs.
      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: rings-config-mindx-dls-test     # The name after rings-config- must be the same as the job name.
      ...
        labels:
          ring-controller.atlas: ascend-910   # Processor type
      ...
      ---
      apiVersion: batch.volcano.sh/v1alpha1   # The value cannot be changed. The Volcano API must be used.
      kind: Job                               # The type can only be job.
      metadata:
        name: mindx-dls-test                  # Job name, which can be customized.
      ...
      spec:
        minAvailable: 1                 # If static vNPU scheduling is used, the value must be 1.
      ...
        - name: "default-test"
            replicas: 1                  # If static vNPU scheduling is used, the value must be 1.
            template:
              metadata:
      ...
              spec:
      ...
              containers:
              - image: tensorflow-test:latest  # Training image
      ...
                env:
      ...
               # ASCEND_VISIBLE_DEVICES is not supported by static vNPU scheduling. Delete the following fields in bold:
                - name: ASCEND_VISIBLE_DEVICES                                   
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.annotations['huawei.com/Ascend910']              
      ...
                  resources:  
                    requests:
                      huawei.com/Ascend910-2c: 1          # If static vNPU scheduling is used, the value must be 1.
                    limits:
                      huawei.com/Ascend910-2c: 1          # If static vNPU scheduling is used, the value must be 1.
      ...
                  nodeSelector:
                    host-arch: huawei-arm              # (Optional) Set it as required.
                    accelerator-type: module    # Schedule to Atlas 800 training server.
      ...

      After the modification is complete, go to Step 2 to configure other fields of the YAML file.

    The operations for configuring the YAML files for full NPU scheduling or static vNPU scheduling are different only in step 1. The operations after step 1 are the same for both scheduling features.

  2. To configure CPU and memory resources, manually add the cpu and memory parameters and their values by referring to the following example. Set the specific values as required.
    ...
              resources:  
                requests:
                  huawei.com/Ascend910: 8
                  cpu: 100m            
                  memory: 100Gi      
                limits:
                  huawei.com/Ascend910: 8
                  cpu: 100m
                  memory: 100Gi
    ...
  3. Modify the mount paths of the training script and code.

    The base image pulled from Ascend image repository does not contain files such as the training script and code. During training, these files are usually mounted and mapped to a container.

              volumeMounts:
              - name: ascend-910-config
                mountPath: /user/serverid/devindex/config
              - name: code
                mountPath: /job/code                     # Path of the training script in the container.
              - name: data
                mountPath: /job/data                      # Path of the training dataset in the container.
              - name: output
                mountPath: /job/output                    # Path of the training output in the container.
  4. As shown below, the three parameters following the bash train_start.sh training command in the YAML file are the directory of the training code in the container, the output directory (contains the generated log redirection file and TensorFlow model file), and the relative path of the startup script to the code directory. The subsequent parameters starting with -- are required by the training script. For details about how to modify the single-server and distributed training scripts and script parameters, see the model description in the model script source.
    • TensorFlow command parameters
      command:
      - "/bin/bash"
      - "-c"
      - "cd /job/code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ /job/output/ tensorflow/resnet_ctl_imagenet_main.py --data_dir=/job/data/imagenet_TF --distribution_strategy=one_device --use_tf_while_loop=true --epochs_between_evals=1 --skip_eval --enable_checkpoint_and_export;"
      ...
    • PyTorch command parameters
      command:
      - "/bin/bash"
      - "-c"
      - "cd /job/code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ /job/output/ main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --lr=1.6 --world-size=1 --dist-backend='hccl' --multiprocessing-distributed --epochs=90 --batch-size=1024;"
      ...
    • MindSpore command parameters
      command:
      - "/bin/bash"
      - "-c"
      - "cd /job/code/scripts;chmod +x train_start.sh;bash train_start.sh /job/code/ /job/output/ train.py  --config_path=/job/code/config/resnet50_imagenet2012_config.yaml --output_dir=/job/output --run_distribute=True --device_num=8 --data_path=/job/data/imagenet/train"
      ...
      The TensorFlow command parameters are used as an example.
      • /job/code/: path of the training script in the container, which is defined in Step 3.
      • /job/output/: path of the training dataset in the container, which is defined in Step 3.
      • tensorflow/resnet_ctl_imagenet_main.py: path of the training startup script.
  5. If the NFS is used, specify the NFS server address, training dataset path, script path, and training output path in the YAML file as required. If the NFS is not used, modify the configuration based on the Kubernetes guide.
    ...
              volumeMounts:
              - name: ascend-910-config
                mountPath: /user/serverid/devindex/config
              - name: code
                mountPath: /job/code                     # Path of the training script in the container.
              - name: data
                mountPath: /job/data                      # Path of the training dataset in the container.
              - name: output
                mountPath: /job/output                    # Path of the training output in the container.
    ...
            volumes:
    ...
            - name: code
              nfs:
                server: 127.0.0.1        # IP address of the NFS server.
                path: "xxxxxx"           # Training script path.
            - name: data
              nfs:
                server: 127.0.0.1
                path: "xxxxxx"           # Training dataset path.
            - name: output
              nfs:
                server: 127.0.0.1
                path: "xxxxxx"           # Set the path for saving the script-related model.
    ...