Preparation of Job YAML Files

  • If you do not use Ascend Docker Runtime, Ascend Device Plugin only helps you mount devices in the /dev directory. For other directories (such as /usr), you need to modify the YAML file and mount the corresponding driver directories and files. The mount path in the container must be the same as the host path.
  • Ascend Docker Runtime is not supported by Atlas 200I SoC A1 core boards, so you do not need to modify the YAML file.

Procedure

  1. Download the corresponding YAML file.
    Table 1 YAML files of different hardware models

    Job Type

    Hardware Model

    YAML File Path

    How to Obtain

    Deployment job scheduled by Volcano

    Atlas 200I SoC A1 core board

    infer-deploy-310p-1usoc.yaml

    Click here.

    Inference nodes of other types

    infer-deploy.yaml

    Volcano Job

    Atlas 800I A2 inference server

    A200I A2 Box heterogeneous component

    Atlas 800I A3 SuperPoD Server

    infer-vcjob-910.yaml

    Click here.

    Ascend Job

    Inference server (equipped with Atlas 300I Duo inference cards)

    pytorch_acjob_infer_310p_with_ranktable.yaml

    Click here.

    Atlas 800I A2 inference server

    A200I A2 Box heterogeneous component

    Atlas 800I A3 SuperPoD Server

    pytorch_multinodes_acjob_infer_{xxx}b_with_ranktable.yaml

    Click here.

    For Volcano Jobs, you need to modify the corresponding YAML file based on the example YAML file.

  2. In addition to basic YAML configuration for full NPU scheduling or dynamic vNPU scheduling, add the following fields in bold to enable the rescheduling function. The infer-deploy.yaml file for full NPU scheduling is used as an example.
    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: resnetinfer1-1-deploy
      labels:
          app: infers
    spec:
      replicas: 1
      selector:
        matchLabels:
          app: infers
      template:
        metadata:
          labels:
    ...
             fault-scheduling: grace               # Add this field.
             ring-controller.atlas: ascend-310   # Add this field.
        spec:
          schedulerName: volcano
          nodeSelector:
            host-arch: huawei-arm           # Select the os arch. If the os arch is x86, change it to huawei-x86.
    ...
    Table 2 fault-scheduling description

    Parameter

    Value

    Description

    fault-scheduling

    grace

    Job rescheduling enabled. Gracefully delete the original pod during the rescheduling.

    force

    Forcible deletion mode enabled for a job to forcibly delete the original pod during the process.

    ring-controller.atlas

    • Inference server (equipped with Atlas 300I inference cards): ascend-310
    • Atlas inference product: ascend-310P
    • Atlas 800I A2 inference server/A200I A2 Box heterogeneous component/Atlas 800I A3 SuperPoD Server: ascend-{xxx}b

    Indicates the processor type used by the job.

  3. Mount the weight file.
    ...
                  ports:     # Collective communication port for distributed training
                    - containerPort: 2222      
                      name: ascendjob-port      
                  resources:
                    limits:
                      huawei.com/Ascend310P: 1   # Number of allocated processors
                    requests:
                      huawei.com/Ascend310P: 1   # The value must be the same as that of limits.
                  volumeMounts:
    ...
                      # Mount path of the weight file
                    - name: weights                  
                      mountPath: /path-to-weights
    ...
              volumes:
    ...
                # Mount path of the weight file
                - name: weights
                  hostPath:
                    path: /path-to-weights  # Shared storage or local storage path. Change it as required.
    ...
    • /path-to-weights indicates model weights, which need to be prepared by yourself. You can download the MindIE image by referring to the $ATB_SPEED_HOME_PATH/examples/models/llama3/README.md file.
    • The default value of ATB_SPEED_HOME_PATH is /usr/local/Ascend/atb-models, which has been configured in the set_env.sh script in the source model repository. You do not need to configure it by yourself.
  4. Modify the container startup command in the example YAML file, as shown in the following information in bold. If the command field does not exist, add it.
    ...
          containers:
          - image: ubuntu-infer:v1
    ...
            command: ["/bin/bash", "-c", "cd $ATB_SPEED_HOME_PATH; python examples/run_pa.py --model_path /path-to-weights"]
            resources:
              requests:
    ...