NPU Exporter

Application Scenario

While running a task, in addition to monitoring chip faults, it is crucial to pay attention to the network usage and computing power of chips. This helps identify performance bottlenecks and provides direction for improving task performance. MindCluster provides NPU Exporter, deployed on compute nodes, to report chip data.

Component Function

  • Obtain chip and network data from the driver.
  • Adapt to Prometheus hook functions and provide standard APIs for the Prometheus service to call.
  • Adapt to Telegraf hook functions and provide standard APIs for the Telegraf service to call.

Upstream and Downstream Dependencies

Figure 1 Upstream and downstream dependencies
  1. Obtain the chip and network information from the driver and save the information to the local cache.
  2. Obtain container information from the Kubernetes standard interface CRI and save the information to the local cache.
  3. Implement the Prometheus or Telegraf APIs to periodically obtain data from the cache.