NPU Training Job

NPU training jobs in "Typical Scenarios" are classified into the following types:

Basic Process of NPU Training Jobs Using Volcano as the Scheduler

  1. Training jobs require the HCCL configuration file (ranktable file, also called the hccl.json file). Create the following ConfigMap resources and wait for the configuration file to be generated. The italic and bold content cannot be modified. The following is an example: Pay attention to the ConfigMap name. After the prefix rings-config- is deleted, the ConfigMap name is mindx-dls-test, which is used in this example.
    apiVersion: v1
    kind: ConfigMap
    metadata:
      name: rings-config-mindx-dls-test     
      namespace: vcjob                      
      labels:
        ring-controller.atlas: ascend-910   
    data:
      hccl.json: |
        {
            "status":"initializing"
        }
  2. Create a job of the vcjob or Deployment type.
    • vcjob resource example
      apiVersion: batch.volcano.sh/v1alpha1
      kind: Job
      metadata:
        name: mindx-dls-test
        namespace: vcjob
        labels:
          ring-controller.atlas: ascend-910
      spec:
        minAvailable: 1
        schedulerName: volcano
        maxRetry: 3
        queue: default
        tasks:
        - name: "default-test"
          replicas: 1
          template:
            metadata:
              labels:
                app: tf
                ring-controller.atlas: ascend-910
            spec:
              containers:
              - image: tf_arm64:b030
                imagePullPolicy: IfNotPresent
                name: tf
                env:
                - name: mindx-dls-test
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.name
                - name: XDL_IP
                  valueFrom:
                    fieldRef:
                      fieldPath: status.hostIP
                command: xxxxxxx
                resources:
                  requests:
                    huawei.com/Ascend910: 8
                  limits:
                    huawei.com/Ascend910: 8
                volumeMounts:
                - name: ascend-910-config
                  mountPath: /user/serverid/devindex/config
              nodeSelector:
                host-arch: huawei-arm
              volumes:
              - name: ascend-910-config
                configMap:
                  name: rings-config-mindx-dls-test
              restartPolicy: OnFailure
      • The value of metadata.name must be the same as the job name mentioned in 1. In this example, the value is mindx-dls-test.
      • It is recommended that the values of minAvailable and replicas be the same.
      • Both metadata.labels and spec.tasks of a job must contain the ring-controller.atlas: ascend-910 label.
      • The schedulerName of the scheduler must be Volcano.
      • The NPU resource type must be specified in the resource request and limit, and the number of NPU resources must be the same. You can view the node details in the Kubernetes cluster to determine the NPU resource type that can be used by the node, such as the devices and NPUs after computing power allocation.
      • You must mount the ConfigMap generated in 1 to the container as a file.
      • By default, nodeSelector supports only the key-value pairs configured in the YAML file when Volcano is started and the host-arch label must be used. For details about how to add a user-defined selector, see Volcano Scheduling Configuration.
      • Currently, only one container in a pod can use NPUs.
      • Mount driver-related directories. If either of the following conditions is not met, you need to mount driver-related directories.
        • When the startup parameter useAscendDocker of the Ascend Device Plugin is set to true and the Ascend Docker Runtime has been installed and takes effect, the driver-related directories installed in /usr/local/Ascend are automatically mounted.
        • When the startup parameter useAscendDocker of the Ascend Device Plugin is set to false, the driver-related directories installed in /usr/local/Ascend are automatically mounted.
      • You need to mount dataset and model code paths, and add other required content, such as environment variables.
      • You need to set the container startup command, which corresponds to the command field in the YAML file. In addition, you need to parse the mounted ConfigMap before starting the job to set necessary environment variables for the training job.
    • Deployment resource example
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: mindx-dls-test
        labels:
          app: tf
          ring-controller.atlas: ascend-910
        namespace: vcjob
      spec:
        replicas: 1
        selector:
          matchLabels:
            app: tf
        template:
          metadata:
            labels:
              app: tf
              ring-controller.atlas: ascend-910
              deploy-name: mindx-dls-test
          spec:
            schedulerName: volcano
            nodeSelector:
              host-arch: huawei-x86
            containers:
              - image: tf_arm64:b030
                imagePullPolicy: IfNotPresent
                name: tf
                env:
                - name: mindx-dls-test
                  valueFrom:
                    fieldRef:
                      fieldPath: metadata.name
                - name: XDL_IP
                  valueFrom:
                    fieldRef:
                      fieldPath: status.hostIP
                command: xxx
                resources:
                  requests:
                    huawei.com/Ascend910: 8
                  limits:
                    huawei.com/Ascend910: 8
                volumeMounts:
                - name: ascend-910-config
                  mountPath: /user/serverid/devindex/config
            volumes:
            - name: ascend-910-config
              configMap:
                name: rings-config-mindx-dls-test
      • The value of metadata.name must be the same as the job name mentioned in 1. In this example, the value is mindx-dls-test.
      • replicas indicates the number of nodes. For a single-node job, the value is 1. For a multi-node distributed job, the value is the actual number of nodes.
      • Both metadata.labels and spec.template of the Deployment must contain the label ring-controller.atlas: ascend-910. In addition, spec.template must contain the label whose key is deploy-name and value is the job name.
      • For other information, see the fourth item and the content after it in vcjob resource example.

Basic Process of NPU Training Jobs Not Using Volcano as the Scheduler

Use a resource type, such as Job, Deployment, or other resource types, to create a training job. For details about how to create Job and Deployment resources, see the official examples of Kubernetes.
  • Ensure job consistency, especially in distributed scenarios. This prevents resource waste caused by job execution failures due to insufficient resources, and also prevents job failures or performance deterioration caused by non-affinity of two or four devices allocated to a single node.
  • Change the NPU resource name and quantity in the request and limit. You can view the node details in the Kubernetes cluster to determine the NPU resource types that can be used by the node, such as the devices and NPUs after computing power allocation.
  • Currently, only one container in a pod can use NPUs.
  • Mount driver-related directories. If either of the following conditions is not met, you need to mount driver-related directories.
    • When the startup parameter useAscendDocker of the Ascend Device Plugin is set to true and the Ascend Docker Runtime has been installed and takes effect, the driver-related directories installed in /usr/local/Ascend are automatically mounted.
    • When the startup parameter useAscendDocker of the Ascend Device Plugin is set to false, the driver-related directories installed in /usr/local/Ascend are automatically mounted.
  • You need to mount dataset and model code paths, and add other required content, such as environment variables.
  • You need to set the container startup command, which corresponds to the command field in the YAML file.
  • You need to generate an HCCL configuration file for the pod of each training job. For a distributed training job, ensure that the content of the file in each pod of the group of jobs is the same. Before the training job is executed, the file is parsed to set necessary environment variables for the training job.