Ascend Device Plugin

Application Scenario

Kubernetes needs to detect resource information for scheduling. Besides the basic CPU and memory data, the Kubernetes device plugin mechanism is necessary to define new resource types and customize solutions for resource discovery and reporting. Ascend Device Plugin, deployed on compute nodes, can provide resource discovery and reporting policies suitable for Ascend devices.

Component Function

  • Obtain the processor type and model from the driver and report them to kubelet and ClusterD (upper-layer service for resource scheduling).
  • Subscribe to processor fault information from the driver, report the processor status to kubelet, and report both the processor status and fault information to the upper-layer service for resource scheduling.
  • Subscribe to UnifiedBus network fault information from the UnifiedBus driver, report the network status to kubelet, and report both the UnifiedBus network status and fault information to the upper-layer service for resource scheduling.
  • Establish fault handling levels. The level can be escalated if a fault recurs frequently or persists for an extended period.
  • Obtain the selected processor information for cluster scheduling in the resource mounting phase and pass the information to Ascend Docker Runtime for mounting through environment variables.
  • Perform a hot reset on an idle, faulty processor. The processor can be recoverable after a restart.

Upstream and Downstream Dependencies

Figure 1 Upstream and downstream dependencies
  1. Obtain the processor type, quantity, and health status from the DCMI, or deliver a processor reset command.
  2. Reports the processor type, quantity, and status to kubelet.
  3. Report the processor type, quantity, and fault information to ClusterD.
  4. Inform Ascend Docker Runtime of the processor information selected by a scheduler via environment variables.
  5. Deliver the commands for starting and stopping training jobs to the container.