Examples

Obtaining the YAML File of a Job

Visit the Gitee repository to download the YAML file of a job as required.

Table 1 YAML files

YAML Name

Description

train-volcano.yaml

The Volcano is used as the scheduler in the training environment.

train-no-volcano.yaml

Other schedulers are used in the training environment.

infer-volcano.yaml

The Volcano is used as the scheduler in the inference environment.

infer-no-volcano.yaml

Other schedulers are used in the inference environment.

infer-310p-1usoc-volcano.yaml

The Volcano is used as the scheduler in the inference environment. This file is applicable only to the Atlas 200I Soc A1 core board.

infer-310p-1usoc-no-volcano.yaml

Other schedulers are used in the inference environment. This file is applicable only to the Atlas 200I Soc A1 core board.

If the device management scenario is selected during the cluster scheduling component deployment, the YAML file whose name contains "volcano" cannot be used. In other installation scenarios, only the YAML file whose name contains "volcano" can be used.

Modifying the YAML Configuration

  1. Modify nodeSelector based on the node architecture to which the job needs to be deployed.
    • If the job is deployed on an x86 node, change nodeSelector to host-arch: huawei-x86. An example is as follows:
      ...
      spec:
        template:
          spec:
            nodeSelector:
              host-arch: huawei-x86                   
            containers:
            - image: ascend-k8sdeviceplugin:v3.0.0
      ...
    • If the job is deployed on an AArch64 node, change nodeSelector to host-arch: huawei-arm. An example is as follows:
      ...
      spec:
        template:
          spec:
            nodeSelector:
              host-arch: huawei-arm                   
            containers:
            - image: ascend-k8sdeviceplugin:v3.0.0
      ...
  2. Change the image version used by the job based on the installed cluster scheduling component version.
    ...
    spec:
      template:
        spec:
          nodeSelector:
            host-arch: huawei-arm                  
          containers:
          - image: ascend-k8sdeviceplugin:v3.0.0
    ...
  3. (Optional) Skip this step if you are verifying the training environment. Skip this step if you do not need to change the processor type in the YAML file used by the Atlas 200I Soc A1 core board. If the Atlas inference products and Atlas 200/300/500 inference product are used in the inference environment, change the processor type in the inference YAML file based on the node type.
    ...
          containers:
          - image: ascend-k8sdeviceplugin:v3.0.0                  
            imagePullPolicy: IfNotPresent
            name: infer-env-quick-validation
            command: [ "/bin/bash", "-c", "npu-smi info" ]
            resources:
              requests:
                huawei.com/Ascend310: 1    # For the Atlas inference products, change Ascend 310 to Ascend 310P.
              limits:
                huawei.com/Ascend310: 1    # For the Atlas inference products, change Ascend 310 to Ascend 310P.
    ...    
  4. (Optional) Skip this step if the Volcano is not used as the scheduler. Otherwise, modify replicas to check the driver status of multiple nodes.
    • Refer to the following example to modify the train-volcano.yaml file of a training job:
      apiVersion: batch.volcano.sh/v1alpha1   
      kind: Job                               
      metadata:
        name: mindx-dls-test
      spec:
        minAvailable: 1       # Its value must be the same as that of replicas.
        schedulerName: volcano                
        maxRetry: 1
        queue: default
        tasks:
        - name: "default-test"
          replicas: 1        # Number of nodes
          template:
            spec:
              containers:
              - image: ascend-k8sdeviceplugin:v3.0.0       
                name: teswts
                imagePullPolicy: IfNotPresent
                command: ["/bin/bash", "-c", "npu-smi info"]
                resources:
                  requests:
                    huawei.com/Ascend910: 1        # If the value of replicas is greater than 1, the number of NPUs can only be 8 for the Atlas 800 training server and 2 for the server (with Atlas 300T training cards).
                  limits:
                    huawei.com/Ascend910: 1        # If the value of replicas is greater than 1, the number of NPUs can only be 8 for the Atlas 800 training server and 2 for the server (with Atlas 300T training cards).
                volumeMounts:
      ...
    • Refer to the following example to modify the infer-volcano.yaml file of an inference job:
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: infer-env-quick-validation
      spec:
        replicas: 1     # Number of nodes
        selector:
          matchLabels:
            app: infers
        template:
          metadata: 
            labels:
               app: infers
          spec:
            schedulerName: volcano
            nodeSelector:
              host-arch: huawei-arm           
            containers:
            - image: ascend-k8sdeviceplugin:v3.0.0                  
              imagePullPolicy: IfNotPresent
              name: infer-env-quick-validation
              command: [ "/bin/bash", "-c", "npu-smi info" ]
              resources:
                requests:
                  huawei.com/Ascend310: 1   
                limits:
                  huawei.com/Ascend310: 1     
      ...    

Delivering a Job

Run the following command on the master node:

kubectl apply -f {job_yaml}

Checking the Result

Run the following command on the master node:

kubectl logs {pod_name}

Example:

kubectl logs infer-env-quick-validation-c8f6d6897-n9fbf

If the similar information is displayed, the driver is properly installed on the node where a job is running. By default, only one NPU is used in the YAML file obtained from Obtaining the YAML File of a Job. You may change the number according to the actual situation.

  • Example of a training job
    +-------------------------------------------------------------------------------------------+
    | npu-smi 22.0.4                              Version: 22.0.4                                           |
    +----------------------+---------------+----------------------------------------------------+
    | NPU   Name           | Health        | Power(W)    Temp(C)           Hugepages-Usage(page)|
    | Chip                 | Bus-Id        | AICore(%)   Memory-Usage(MB)  HBM-Usage(MB)        |
    +======================+===============+====================================================+
    | 0     910A           | OK            | 71.3        47                15   / 15            |
    | 0                    | 0000:61:00.0  | 0           2940 / 15071      30738/ 32768         |
    +======================+===============+====================================================+
  • Example of an inference job on the Ascend 310 AI Processor
    +--------------------------------------------------------------------------------------------------------+
    | npu-smi 22.0.4                              Version: 22.0.4                                  |
    +-------------------------------+-----------------+------------------------------------------------------+
    | NPU     Name                  | Health          | Power(W)     Temp(C)           Hugepages-Usage(page) |
    | Chip    Device                | Bus-Id          | AICore(%)    Memory-Usage(MB)                        |
    +===============================+=================+======================================================+
    | 0       310                   | OK              | 12.8         49                0    / 969            |
    | 0       0                     | 0000:04:00.0    | 0            622  / 7759                             |
    +===============================+=================+======================================================+

    If an inference job uses a YAML file whose name contains "volcano", the pod will be repeatedly started. As a result, the NPU is always occupied by the pod. In this case, delete the job in a timely manner.

Deleting a Job

Run the following command on the master node to delete a job:

kubectl delete -f {job_yaml}