Examples
Obtaining the YAML File of a Job
Visit the Gitee repository to download the YAML file of a job as required.
YAML Name |
Description |
|---|---|
train-volcano.yaml |
The Volcano is used as the scheduler in the training environment. |
train-no-volcano.yaml |
Other schedulers are used in the training environment. |
infer-volcano.yaml |
The Volcano is used as the scheduler in the inference environment. |
infer-no-volcano.yaml |
Other schedulers are used in the inference environment. |
infer-310p-1usoc-volcano.yaml |
The Volcano is used as the scheduler in the inference environment. This file is applicable only to the Atlas 200I Soc A1 core board. |
infer-310p-1usoc-no-volcano.yaml |
Other schedulers are used in the inference environment. This file is applicable only to the Atlas 200I Soc A1 core board. |
If the device management scenario is selected during the cluster scheduling component deployment, the YAML file whose name contains "volcano" cannot be used. In other installation scenarios, only the YAML file whose name contains "volcano" can be used.
Modifying the YAML Configuration
- Modify nodeSelector based on the node architecture to which the job needs to be deployed.
- If the job is deployed on an x86 node, change nodeSelector to host-arch: huawei-x86. An example is as follows:
... spec: template: spec: nodeSelector: host-arch: huawei-x86 containers: - image: ascend-k8sdeviceplugin:v3.0.0 ... - If the job is deployed on an AArch64 node, change nodeSelector to host-arch: huawei-arm. An example is as follows:
... spec: template: spec: nodeSelector: host-arch: huawei-arm containers: - image: ascend-k8sdeviceplugin:v3.0.0 ...
- If the job is deployed on an x86 node, change nodeSelector to host-arch: huawei-x86. An example is as follows:
- Change the image version used by the job based on the installed cluster scheduling component version.
... spec: template: spec: nodeSelector: host-arch: huawei-arm containers: - image: ascend-k8sdeviceplugin:v3.0.0 ... - (Optional) Skip this step if you are verifying the training environment. Skip this step if you do not need to change the processor type in the YAML file used by the Atlas 200I Soc A1 core board. If the Atlas inference products and Atlas 200/300/500 inference product are used in the inference environment, change the processor type in the inference YAML file based on the node type.
... containers: - image: ascend-k8sdeviceplugin:v3.0.0 imagePullPolicy: IfNotPresent name: infer-env-quick-validation command: [ "/bin/bash", "-c", "npu-smi info" ] resources: requests: huawei.com/Ascend310: 1 # For the Atlas inference products, change Ascend 310 to Ascend 310P. limits: huawei.com/Ascend310: 1 # For the Atlas inference products, change Ascend 310 to Ascend 310P. ... - (Optional) Skip this step if the Volcano is not used as the scheduler. Otherwise, modify replicas to check the driver status of multiple nodes.
- Refer to the following example to modify the train-volcano.yaml file of a training job:
apiVersion: batch.volcano.sh/v1alpha1 kind: Job metadata: name: mindx-dls-test spec: minAvailable: 1 # Its value must be the same as that of replicas. schedulerName: volcano maxRetry: 1 queue: default tasks: - name: "default-test" replicas: 1 # Number of nodes template: spec: containers: - image: ascend-k8sdeviceplugin:v3.0.0 name: teswts imagePullPolicy: IfNotPresent command: ["/bin/bash", "-c", "npu-smi info"] resources: requests: huawei.com/Ascend910: 1 # If the value of replicas is greater than 1, the number of NPUs can only be 8 for the Atlas 800 training server and 2 for the server (with Atlas 300T training cards). limits: huawei.com/Ascend910: 1 # If the value of replicas is greater than 1, the number of NPUs can only be 8 for the Atlas 800 training server and 2 for the server (with Atlas 300T training cards). volumeMounts: ... - Refer to the following example to modify the infer-volcano.yaml file of an inference job:
apiVersion: apps/v1 kind: Deployment metadata: name: infer-env-quick-validation spec: replicas: 1 # Number of nodes selector: matchLabels: app: infers template: metadata: labels: app: infers spec: schedulerName: volcano nodeSelector: host-arch: huawei-arm containers: - image: ascend-k8sdeviceplugin:v3.0.0 imagePullPolicy: IfNotPresent name: infer-env-quick-validation command: [ "/bin/bash", "-c", "npu-smi info" ] resources: requests: huawei.com/Ascend310: 1 limits: huawei.com/Ascend310: 1 ...
- Refer to the following example to modify the train-volcano.yaml file of a training job:
Delivering a Job
Run the following command on the master node:
kubectl apply -f {job_yaml}
Checking the Result
Run the following command on the master node:
kubectl logs {pod_name}
Example:
kubectl logs infer-env-quick-validation-c8f6d6897-n9fbf
If the similar information is displayed, the driver is properly installed on the node where a job is running. By default, only one NPU is used in the YAML file obtained from Obtaining the YAML File of a Job. You may change the number according to the actual situation.
- Example of a training job
+-------------------------------------------------------------------------------------------+ | npu-smi 22.0.4 Version: 22.0.4 | +----------------------+---------------+----------------------------------------------------+ | NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page)| | Chip | Bus-Id | AICore(%) Memory-Usage(MB) HBM-Usage(MB) | +======================+===============+====================================================+ | 0 910A | OK | 71.3 47 15 / 15 | | 0 | 0000:61:00.0 | 0 2940 / 15071 30738/ 32768 | +======================+===============+====================================================+
- Example of an inference job on the Ascend 310 AI Processor
+--------------------------------------------------------------------------------------------------------+ | npu-smi 22.0.4 Version: 22.0.4 | +-------------------------------+-----------------+------------------------------------------------------+ | NPU Name | Health | Power(W) Temp(C) Hugepages-Usage(page) | | Chip Device | Bus-Id | AICore(%) Memory-Usage(MB) | +===============================+=================+======================================================+ | 0 310 | OK | 12.8 49 0 / 969 | | 0 0 | 0000:04:00.0 | 0 622 / 7759 | +===============================+=================+======================================================+
If an inference job uses a YAML file whose name contains "volcano", the pod will be repeatedly started. As a result, the NPU is always occupied by the pod. In this case, delete the job in a timely manner.
Deleting a Job
Run the following command on the master node to delete a job:
kubectl delete -f {job_yaml}