Examples

Obtaining the YAML File of a Job

Visit the Gitee repository to download the YAML file of a job as required.

**Table 1** YAML files
YAML Name	Description
train-volcano.yaml	The Volcano is used as the scheduler in the training environment.
train-no-volcano.yaml	Other schedulers are used in the training environment.
infer-volcano.yaml	The Volcano is used as the scheduler in the inference environment.
infer-no-volcano.yaml	Other schedulers are used in the inference environment.
infer-310p-1usoc-volcano.yaml	The Volcano is used as the scheduler in the inference environment. This file is applicable only to the Atlas 200I Soc A1 core board.
infer-310p-1usoc-no-volcano.yaml	Other schedulers are used in the inference environment. This file is applicable only to the Atlas 200I Soc A1 core board.

If the device management scenario is selected during the cluster scheduling component deployment, the YAML file whose name contains "volcano" cannot be used. In other installation scenarios, only the YAML file whose name contains "volcano" can be used.

Modifying the YAML Configuration

Modify nodeSelector based on the node architecture to which the job needs to be deployed.

If the job is deployed on an x86 node, change nodeSelector to host-arch: huawei-x86. An example is as follows:

...
spec:
  template:
    spec:
      nodeSelector:
        host-arch: huawei-x86                   
      containers:
      - image: ascend-k8sdeviceplugin:v3.0.0
...

If the job is deployed on an AArch64 node, change nodeSelector to host-arch: huawei-arm. An example is as follows:

...
spec:
  template:
    spec:
      nodeSelector:
        host-arch: huawei-arm                   
      containers:
      - image: ascend-k8sdeviceplugin:v3.0.0
...

Change the image version used by the job based on the installed cluster scheduling component version.

...
spec:
  template:
    spec:
      nodeSelector:
        host-arch: huawei-arm                  
      containers:
      - image: ascend-k8sdeviceplugin:v3.0.0
...

(Optional) Skip this step if you are verifying the training environment. Skip this step if you do not need to change the processor type in the YAML file used by the Atlas 200I Soc A1 core board. If the Atlas inference products and Atlas 200/300/500 inference product are used in the inference environment, change the processor type in the inference YAML file based on the node type.

...
      containers:
      - image: ascend-k8sdeviceplugin:v3.0.0                  
        imagePullPolicy: IfNotPresent
        name: infer-env-quick-validation
        command: [ "/bin/bash", "-c", "npu-smi info" ]
        resources:
          requests:
            huawei.com/Ascend310: 1    # For the Atlas inference products, change Ascend 310 to Ascend 310P.
          limits:
            huawei.com/Ascend310: 1    # For the Atlas inference products, change Ascend 310 to Ascend 310P.
...

(Optional) Skip this step if the Volcano is not used as the scheduler. Otherwise, modify replicas to check the driver status of multiple nodes.

Refer to the following example to modify the train-volcano.yaml file of a training job:

apiVersion: batch.volcano.sh/v1alpha1   
kind: Job                               
metadata:
  name: mindx-dls-test
spec:
  minAvailable: 1       # Its value must be the same as that of replicas.
  schedulerName: volcano                
  maxRetry: 1
  queue: default
  tasks:
  - name: "default-test"
    replicas: 1        # Number of nodes
    template:
      spec:
        containers:
        - image: ascend-k8sdeviceplugin:v3.0.0       
          name: teswts
          imagePullPolicy: IfNotPresent
          command: ["/bin/bash", "-c", "npu-smi info"]
          resources:
            requests:
              huawei.com/Ascend910: 1        # If the value of replicas is greater than 1, the number of NPUs can only be 8 for the Atlas 800 training server and 2 for the server (with Atlas 300T training cards).
            limits:
              huawei.com/Ascend910: 1        # If the value of replicas is greater than 1, the number of NPUs can only be 8 for the Atlas 800 training server and 2 for the server (with Atlas 300T training cards).
          volumeMounts:
...

Refer to the following example to modify the infer-volcano.yaml file of an inference job:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: infer-env-quick-validation
spec:
  replicas: 1     # Number of nodes
  selector:
    matchLabels:
      app: infers
  template:
    metadata: 
      labels:
         app: infers
    spec:
      schedulerName: volcano
      nodeSelector:
        host-arch: huawei-arm           
      containers:
      - image: ascend-k8sdeviceplugin:v3.0.0                  
        imagePullPolicy: IfNotPresent
        name: infer-env-quick-validation
        command: [ "/bin/bash", "-c", "npu-smi info" ]
        resources:
          requests:
            huawei.com/Ascend310: 1   
          limits:
            huawei.com/Ascend310: 1     
...

Delivering a Job

Run the following command on the master node:

kubectl apply -f {job_yaml}

Checking the Result

Run the following command on the master node:

kubectl logs {pod_name}

Example:

kubectl logs infer-env-quick-validation-c8f6d6897-n9fbf

If the similar information is displayed, the driver is properly installed on the node where a job is running. By default, only one NPU is used in the YAML file obtained from Obtaining the YAML File of a Job. You may change the number according to the actual situation.

Example of a training job

+-------------------------------------------------------------------------------------------+
| npu-smi 22.0.4                              Version: 22.0.4                                           |
+----------------------+---------------+----------------------------------------------------+
| NPU   Name           | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
| Chip                 | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
+======================+===============+====================================================+
| 0     910A           | OK            | 71.3        47                15   / 15            |
| 0                    | 0000:61:00.0  | 0           2940 / 15071      30738/ 32768         |
+======================+===============+====================================================+

Example of an inference job on the Ascend 310 AI Processor

+--------------------------------------------------------------------------------------------------------+
| npu-smi 22.0.4                              Version: 22.0.4                                  |
+-------------------------------+-----------------+------------------------------------------------------+
| NPU     Name                  | Health          | Power(W)     Temp(C)           Hugepages-Usage(page) |
| Chip    Device                | Bus-Id          | AICore(%)    Memory-Usage(MB)                        |
+===============================+=================+======================================================+
| 0       310                   | OK              | 12.8         49                0    / 969            |
| 0       0                     | 0000:04:00.0    | 0            622  / 7759                             |
+===============================+=================+======================================================+

If an inference job uses a YAML file whose name contains "volcano", the pod will be repeatedly started. As a result, the NPU is always occupied by the pod. In this case, delete the job in a timely manner.

Deleting a Job

Run the following command on the master node to delete a job:

kubectl delete -f {job_yaml}

Parent topic: Quick Environment Verification