Before You Start

Resource monitoring mainly includes: (1) vNPU AI Core usage, total vNPU memory, and used vNPU memory (2) real-time NPU resource usage in training or inference jobs, including the usage, temperature, voltage, memory, and allocation status in containers of Ascend AI processors.

Resource monitoring is a fundamental feature that is not specific to either training or inference scenarios, and is also not specific to the Volcano scheduler or other schedulers. This feature must work with Prometheus or Telegraf. If it works with Prometheus, deploy Prometheus first and call related NPU Exporter APIs to monitor resources. If it works with Telegraf, deploy and run Telegraf to monitor resources.

Prometheus is a comprehensive open-source monitoring solution that features easy management, high efficiency, scalability, and visualization. It works with NPU Exporter to achieve real-time monitoring of each Ascend AI processor's usage, temperature, voltage, memory, and allocation status in containers. It can also monitor the vNPU AI Core usage, total vNPU memory, and used vNPU memory.
Telegraf collects statistics on the system and services. It occupies a small amount of memory and supports the expansion of other services. It works with NPU Exporter. You can view the reported information about Ascend AI processors in the command output in your environment.

Prerequisites

Before using the resource monitoring feature, ensure that NPU Exporter has been installed. If it is not installed, install it by referring to Installation and Deployment.
Before starting NPU Exporter, ensure that NPUs work properly.

Instruction

Resource monitoring can be used together with all features in the training or inference scenario.

Supported Products

Resource monitoring supports the following products:

Atlas training product
Atlas A2 training product
Atlas A3 training product
Inference server (equipped with Atlas 300I inference cards)
Atlas inference product
Atlas 800I A2 inference server
A200I A2 Box heterogeneous component
Atlas 800I A3 SuperPoD Server

Parent topic: Resource Monitoring Feature Guide