NPU Training Job
NPU training jobs in "Typical Scenarios" are classified into the following types:
- Using Volcano as the scheduler: See Basic Process of NPU Training Jobs Using Volcano as the Scheduler.
- Not using Volcano as the scheduler: See Basic Process of NPU Training Jobs Not Using Volcano as the Scheduler.
Basic Process of NPU Training Jobs Using Volcano as the Scheduler
- Training jobs require the HCCL configuration file (ranktable file, also called the hccl.json file). Create the following ConfigMap resources and wait for the configuration file to be generated. The italic and bold content cannot be modified. The following is an example: Pay attention to the ConfigMap name. After the prefix rings-config- is deleted, the ConfigMap name is mindx-dls-test, which is used in this example.
apiVersion: v1 kind: ConfigMap metadata: name: rings-config-mindx-dls-test namespace: vcjob labels: ring-controller.atlas: ascend-910 data: hccl.json: | { "status":"initializing" } - Create a job of the vcjob or Deployment type.
- vcjob resource example
apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: name: mindx-dls-test namespace: vcjob labels: ring-controller.atlas: ascend-910 spec: minAvailable: 1 schedulerName: volcano maxRetry: 3 queue: default tasks: - name: "default-test" replicas: 1 template: metadata: labels: app: tf ring-controller.atlas: ascend-910 spec: containers: - image: tf_arm64:b030 imagePullPolicy: IfNotPresent name: tf env: - name: mindx-dls-test valueFrom: fieldRef: fieldPath: metadata.name - name: XDL_IP valueFrom: fieldRef: fieldPath: status.hostIP command: xxxxxxx resources: requests: huawei.com/Ascend910: 8 limits: huawei.com/Ascend910: 8 volumeMounts: - name: ascend-910-config mountPath: /user/serverid/devindex/config nodeSelector: host-arch: huawei-arm volumes: - name: ascend-910-config configMap: name: rings-config-mindx-dls-test restartPolicy: OnFailure
- The value of metadata.name must be the same as the job name mentioned in 1. In this example, the value is mindx-dls-test.
- It is recommended that the values of minAvailable and replicas be the same.
- Both metadata.labels and spec.tasks of a job must contain the ring-controller.atlas: ascend-910 label.
- The schedulerName of the scheduler must be Volcano.
- The NPU resource type must be specified in the resource request and limit, and the number of NPU resources must be the same. You can view the node details in the Kubernetes cluster to determine the NPU resource type that can be used by the node, such as the devices and NPUs after computing power allocation.
- You must mount the ConfigMap generated in 1 to the container as a file.
- By default, nodeSelector supports only the key-value pairs configured in the YAML file when Volcano is started and the host-arch label must be used. For details about how to add a user-defined selector, see Volcano Scheduling Configuration.
- Currently, only one container in a pod can use NPUs.
- Mount driver-related directories. If either of the following conditions is not met, you need to mount driver-related directories.
- When the startup parameter useAscendDocker of the Ascend Device Plugin is set to true and the Ascend Docker Runtime has been installed and takes effect, the driver-related directories installed in /usr/local/Ascend are automatically mounted.
- When the startup parameter useAscendDocker of the Ascend Device Plugin is set to false, the driver-related directories installed in /usr/local/Ascend are automatically mounted.
- You need to mount dataset and model code paths, and add other required content, such as environment variables.
- You need to set the container startup command, which corresponds to the command field in the YAML file. In addition, you need to parse the mounted ConfigMap before starting the job to set necessary environment variables for the training job.
- Deployment resource example
apiVersion: apps/v1 kind: Deployment metadata: name: mindx-dls-test labels: app: tf ring-controller.atlas: ascend-910 namespace: vcjob spec: replicas: 1 selector: matchLabels: app: tf template: metadata: labels: app: tf ring-controller.atlas: ascend-910 deploy-name: mindx-dls-test spec: schedulerName: volcano nodeSelector: host-arch: huawei-x86 containers: - image: tf_arm64:b030 imagePullPolicy: IfNotPresent name: tf env: - name: mindx-dls-test valueFrom: fieldRef: fieldPath: metadata.name - name: XDL_IP valueFrom: fieldRef: fieldPath: status.hostIP command: xxx resources: requests: huawei.com/Ascend910: 8 limits: huawei.com/Ascend910: 8 volumeMounts: - name: ascend-910-config mountPath: /user/serverid/devindex/config volumes: - name: ascend-910-config configMap: name: rings-config-mindx-dls-test
- The value of metadata.name must be the same as the job name mentioned in 1. In this example, the value is mindx-dls-test.
- replicas indicates the number of nodes. For a single-node job, the value is 1. For a multi-node distributed job, the value is the actual number of nodes.
- Both metadata.labels and spec.template of the Deployment must contain the label ring-controller.atlas: ascend-910. In addition, spec.template must contain the label whose key is deploy-name and value is the job name.
- For other information, see the fourth item and the content after it in vcjob resource example.
- vcjob resource example
Basic Process of NPU Training Jobs Not Using Volcano as the Scheduler
Use a resource type, such as Job, Deployment, or other resource types, to create a training job. For details about how to create Job and Deployment resources, see the official examples of Kubernetes.
- Ensure job consistency, especially in distributed scenarios. This prevents resource waste caused by job execution failures due to insufficient resources, and also prevents job failures or performance deterioration caused by non-affinity of two or four devices allocated to a single node.
- Change the NPU resource name and quantity in the request and limit. You can view the node details in the Kubernetes cluster to determine the NPU resource types that can be used by the node, such as the devices and NPUs after computing power allocation.
- Currently, only one container in a pod can use NPUs.
- Mount driver-related directories. If either of the following conditions is not met, you need to mount driver-related directories.
- When the startup parameter useAscendDocker of the Ascend Device Plugin is set to true and the Ascend Docker Runtime has been installed and takes effect, the driver-related directories installed in /usr/local/Ascend are automatically mounted.
- When the startup parameter useAscendDocker of the Ascend Device Plugin is set to false, the driver-related directories installed in /usr/local/Ascend are automatically mounted.
- You need to mount dataset and model code paths, and add other required content, such as environment variables.
- You need to set the container startup command, which corresponds to the command field in the YAML file.
- You need to generate an HCCL configuration file for the pod of each training job. For a distributed training job, ensure that the content of the file in each pod of the group of jobs is the same. Before the training job is executed, the file is parsed to set necessary environment variables for the training job.
Parent topic: Quick Start