NPU Training Job

NPU training jobs in "Typical Scenarios" are classified into the following types:

Using Volcano as the scheduler: See Basic Process of NPU Training Jobs Using Volcano as the Scheduler.
Not using Volcano as the scheduler: See Basic Process of NPU Training Jobs Not Using Volcano as the Scheduler.

Basic Process of NPU Training Jobs Using Volcano as the Scheduler

Training jobs require the HCCL configuration file (ranktable file, also called the hccl.json file). Create the following ConfigMap resources and wait for the configuration file to be generated. The italic and bold content cannot be modified. The following is an example: Pay attention to the ConfigMap name. After the prefix rings-config- is deleted, the ConfigMap name is mindx-dls-test, which is used in this example.
```
apiVersion: v1
kind: ConfigMap
metadata:
  name: rings-config-mindx-dls-test     
  namespace: vcjob                      
  labels:
    ring-controller.atlas: ascend-910   
data:
  hccl.json: |
    {
        "status":"initializing"
    }
```

Create a job of the vcjob or Deployment type.

vcjob resource example
```
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: mindx-dls-test
  namespace: vcjob
  labels:
    ring-controller.atlas: ascend-910
spec:
  minAvailable: 1
  schedulerName: volcano
  maxRetry: 3
  queue: default
  tasks:
  - name: "default-test"
    replicas: 1
    template:
      metadata:
        labels:
          app: tf
          ring-controller.atlas: ascend-910
      spec:
        containers:
        - image: tf_arm64:b030
          imagePullPolicy: IfNotPresent
          name: tf
          env:
          - name: mindx-dls-test
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: XDL_IP
            valueFrom:
              fieldRef:
                fieldPath: status.hostIP
          command: xxxxxxx
          resources:
            requests:
              huawei.com/Ascend910: 8
            limits:
              huawei.com/Ascend910: 8
          volumeMounts:
          - name: ascend-910-config
            mountPath: /user/serverid/devindex/config
        nodeSelector:
          host-arch: huawei-arm
        volumes:
        - name: ascend-910-config
          configMap:
            name: rings-config-mindx-dls-test
        restartPolicy: OnFailure
```
- The value of metadata.name must be the same as the job name mentioned in 1. In this example, the value is mindx-dls-test.
- It is recommended that the values of minAvailable and replicas be the same.
- Both metadata.labels and spec.tasks of a job must contain the ring-controller.atlas: ascend-910 label.
- The schedulerName of the scheduler must be Volcano.
- The NPU resource type must be specified in the resource request and limit, and the number of NPU resources must be the same. You can view the node details in the Kubernetes cluster to determine the NPU resource type that can be used by the node, such as the devices and NPUs after computing power allocation.
- You must mount the ConfigMap generated in 1 to the container as a file.
- By default, nodeSelector supports only the key-value pairs configured in the YAML file when Volcano is started and the host-arch label must be used. For details about how to add a user-defined selector, see Volcano Scheduling Configuration.
- Currently, only one container in a pod can use NPUs.
- Mount driver-related directories. If either of the following conditions is not met, you need to mount driver-related directories.
  - When the startup parameter useAscendDocker of the Ascend Device Plugin is set to true and the Ascend Docker Runtime has been installed and takes effect, the driver-related directories installed in /usr/local/Ascend are automatically mounted.
  - When the startup parameter useAscendDocker of the Ascend Device Plugin is set to false, the driver-related directories installed in /usr/local/Ascend are automatically mounted.
- You need to mount dataset and model code paths, and add other required content, such as environment variables.
- You need to set the container startup command, which corresponds to the command field in the YAML file. In addition, you need to parse the mounted ConfigMap before starting the job to set necessary environment variables for the training job.

Deployment resource example

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mindx-dls-test
  labels:
    app: tf
    ring-controller.atlas: ascend-910
  namespace: vcjob
spec:
  replicas: 1
  selector:
    matchLabels:
      app: tf
  template:
    metadata:
      labels:
        app: tf
        ring-controller.atlas: ascend-910
        deploy-name: mindx-dls-test
    spec:
      schedulerName: volcano
      nodeSelector:
        host-arch: huawei-x86
      containers:
        - image: tf_arm64:b030
          imagePullPolicy: IfNotPresent
          name: tf
          env:
          - name: mindx-dls-test
            valueFrom:
              fieldRef:
                fieldPath: metadata.name
          - name: XDL_IP
            valueFrom:
              fieldRef:
                fieldPath: status.hostIP
          command: xxx
          resources:
            requests:
              huawei.com/Ascend910: 8
            limits:
              huawei.com/Ascend910: 8
          volumeMounts:
          - name: ascend-910-config
            mountPath: /user/serverid/devindex/config
      volumes:
      - name: ascend-910-config
        configMap:
          name: rings-config-mindx-dls-test

The value of metadata.name must be the same as the job name mentioned in 1. In this example, the value is mindx-dls-test.
replicas indicates the number of nodes. For a single-node job, the value is 1. For a multi-node distributed job, the value is the actual number of nodes.
Both metadata.labels and spec.template of the Deployment must contain the label ring-controller.atlas: ascend-910. In addition, spec.template must contain the label whose key is deploy-name and value is the job name.
For other information, see the fourth item and the content after it in vcjob resource example.

Basic Process of NPU Training Jobs Not Using Volcano as the Scheduler

Use a resource type, such as Job, Deployment, or other resource types, to create a training job. For details about how to create Job and Deployment resources, see the official examples of Kubernetes.

Ensure job consistency, especially in distributed scenarios. This prevents resource waste caused by job execution failures due to insufficient resources, and also prevents job failures or performance deterioration caused by non-affinity of two or four devices allocated to a single node.
Change the NPU resource name and quantity in the request and limit. You can view the node details in the Kubernetes cluster to determine the NPU resource types that can be used by the node, such as the devices and NPUs after computing power allocation.
Currently, only one container in a pod can use NPUs.
Mount driver-related directories. If either of the following conditions is not met, you need to mount driver-related directories.
- When the startup parameter useAscendDocker of the Ascend Device Plugin is set to true and the Ascend Docker Runtime has been installed and takes effect, the driver-related directories installed in /usr/local/Ascend are automatically mounted.
- When the startup parameter useAscendDocker of the Ascend Device Plugin is set to false, the driver-related directories installed in /usr/local/Ascend are automatically mounted.
You need to mount dataset and model code paths, and add other required content, such as environment variables.
You need to set the container startup command, which corresponds to the command field in the YAML file.
You need to generate an HCCL configuration file for the pod of each training job. For a distributed training job, ensure that the content of the file in each pod of the group of jobs is the same. Before the training job is executed, the file is parsed to set necessary environment variables for the training job.

Parent topic: Quick Start