Working Principle

The schematic diagram of the feature varies slightly depending on the type of inference jobs.

vcjob

Figure 1 shows the principle of vcjob.
Figure 1 vcjob scheduling
The description of each step is as follows:
  1. Cluster scheduling components periodically report node and processor information. kubelet reports the number of processors on the node object.
    • Ascend Device Plugin periodically reports the number of AI Cores to the node.
    • When a node is faulty, NodeD periodically reports the node health status, node hardware fault information, and node DPC shared storage fault information to node-info-cm.
  2. ClusterD reads information in device-info-cm and node-info-cm and writes the information to cluster-info-device-cm and cluster-info-node-cm.
  3. Deliver a vcjob through kubectl or other deep learning platforms.
  4. volcano-controller creates a PodGroup for the job. For details about PodGroup, see the Volcano open source official document.
  5. volcano-controller creates a pod for the job when cluster resources meet the job requirements.
  6. volcano-scheduler selects a proper node for the job based on the node and processor topology information and writes the dynamic virtualization template information to the annotation of the pod.
  7. When kubelet is used to create a container, Ascend Device Plugin is called to mount the processor. Ascend Device Plugin dynamically virtualizes the NPU based on the template information. Ascend Docker Runtime assists in mounting the corresponding resource.

deploy Jobs

Figure 2 shows the principle of deploy jobs.
Figure 2 deploy job scheduling
The description of each step is as follows:
  1. Cluster scheduling components periodically report node and processor information.
    • Ascend Device Plugin periodically reports the number of AI Cores to the node.
    • When a node is faulty, NodeD periodically reports the node health status, node hardware fault information, and node DPC shared storage fault information to node-info-cm.
  2. ClusterD reads information in device-info-cm and node-info-cm and writes the information to cluster-info-device-cm and cluster-info-node-cm.
  3. Deliver a deploy job through kubectl or other deep learning platforms.
  4. kube-controller creates a pod for the job.
  5. volcano-controller creates a PodGroup for the job. For details about PodGroup, see the Volcano open source official document.
  6. volcano-scheduler selects a proper node for the job based on the node and processor topology information and writes the dynamic virtualization template information to the annotation of the pod.
  7. When kubelet is used to create a container, Ascend Device Plugin is called to mount the processor. Ascend Device Plugin dynamically virtualizes the NPU based on the template information of the pod's annotation. Ascend Docker Runtime assists in mounting the corresponding resource.