NPU-Exporter Prometheus Metrics API

Function Description

Provides the Metrics API for Prometheus to call and integrate.

For details about how to integrate Prometheus, see Deploying Prometheus. After Prometheus is started, it can automatically connect to NPU-Exporter.

URL

GET https://ip:port/metrics

For security purposes, the NPU-Exporter enables the container-level port (8082 by default) by default. The request IP address is the IP address of the Kubernetes container. If the Kubernetes network plugin is Calico, the network policy is set to allow the access of the application whose label is app=prometheus.

Request Parameters

None

Response Description

The data is returned in the Prometheus-specific format. The related metrics are as follows. Description about the metrics offered by Prometheus is not provided here.

...
# HELP machine_npu_nums Amount of npu installed on the machine.
# TYPE machine_npu_nums gauge
machine_npu_nums 8
# HELP npu_chip_info_error_code the npu error code
# TYPE npu_chip_info_error_code gauge
npu_chip_info_error_code{id="0"} 0 1613993498553
npu_chip_info_error_code{id="1"} 0 1613993498588
npu_chip_info_error_code{id="2"} 0 1613993498615
npu_chip_info_error_code{id="3"} 0 1613993498645
npu_chip_info_error_code{id="4"} 0 1613993498676
npu_chip_info_error_code{id="5"} 0 1613993498685
npu_chip_info_error_code{id="6"} 0 1613993498715
npu_chip_info_error_code{id="7"} 0 1613993498742
# HELP npu_chip_info_hbm_total_memory the npu hbm total memory
# TYPE npu_chip_info_hbm_total_memory gauge
npu_chip_info_hbm_total_memory{id="0"} 32255 1613993498553
npu_chip_info_hbm_total_memory{id="1"} 32255 1613993498588
npu_chip_info_hbm_total_memory{id="2"} 32255 1613993498615
...
Table 1 Prometheus labels

Label

Description

Unit

machine_npu_nums

Number of Ascend AI Processors

-

npu_chip_info_error_code

Error code of an Ascend AI Processor

-

npu_chip_info_name

Name and ID of an Ascend AI Processor

-

npu_chip_info_health_status

Health status of an Ascend AI Processor

  • 1: healthy
  • 0: unhealthy

npu_chip_info_power

Power consumption of an Ascend AI Processor. For 910 and 310, this parameter refers to processor power consumption. For 310P, it refers to board card power consumption.

W

npu_chip_info_temperature

Temperature of an Ascend AI Processor

°C

npu_chip_info_used_memory

Used memory of an Ascend AI Processor

MB

npu_chip_info_total_memory

Total memory of an Ascend AI Processor

MB

npu_chip_info_hbm_used_memory

Used HBM memory dedicated for the Ascend AI Processor

MB

npu_chip_info_hbm_total_memory

Total HBM memory dedicated for the Ascend AI Processor

MB

npu_chip_info_utilization

AI Core usage of an Ascend AI Processor

%

npu_chip_info_voltage

Voltage of an Ascend AI Processor

V

npu_exporter_version_info

NPU-Exporter version information

-

npu_container_info

NPU container information. The output contains the following fields:

  • containerID: container ID, string.
  • containerName: container name, string.

    The output format is Pod Namespace_Pod name_Container name__.

  • npuID: NPU ID, int.

-

container_npu_total_memory

Total memory size of the NPU with container information. Only the entire card is supported.

The container information contains the following fields:

  • id: NPU ID, int.
  • pod_name: string.
  • container_name: string.
  • namespace: string.

MB

container_npu_used_memory

Used memory of the NPU with container information. Only the entire card is supported.

The container information contains the following fields:

  • id: NPU ID, int.
  • pod_name: string.
  • pod_name: string.
  • container_name: string.

MB

container_npu_utilization

NPU usage with container information. Only the entire card is supported.

The container information contains the following fields:

  • id: NPU ID, int.
  • pod_name: string.
  • container_name: string.
  • namespace: string.

%

Status Code

Table 2 Status code

Status Code

Description

200

Normal

307

Temporary redirection

500

Internal server error