Working with Prometheus
This section describes how to install and deploy Prometheus and view resource monitoring data on Prometheus. For details about the data, see Prometheus Metrics API.
- Directly interconnecting with Prometheus: NPU Exporter can directly import the data of NPUs to Prometheus without additional middleware or agents, simplifying the architecture.
- Interconnecting with Prometheus through Prometheus Operator: NPU Exporter connects to Prometheus through Prometheus Operator, helping you quickly and easily implement platform-based Prometheus service and improving the reliability and maintainability of the monitoring system.
Directly Interconnecting with Prometheus
- Go to the mindcluster-deploy repository, access the corresponding branch based on mindcluster-deploy Version Description, and obtain the prometheus.yaml file in the samples/utils/prometheus/base directory.
- Run the following command on the management node to obtain the image:
docker pull prom/prometheus:v2.10.0
- Before obtaining the image, ensure that you can access the Internet.
- If you do not use prometheus.yaml provided by cluster scheduling components, add the app: prometheus field to the corresponding position based on the YAML file. Otherwise, the NPU Exporter connection may time out.
- Modify the default configuration in prometheus.yaml for obtaining the NPU Exporter metrics as required. The following information in bold is the configuration of obtained NPU Exporter metrics:
... apiVersion: v1 kind: ConfigMap metadata: name: prometheus-config namespace: kube-system data: prometheus.yml: | global: scrape_interval: 15s evaluation_interval: 15s scrape_configs: ... - job_name: 'kubernetes-npu-exporter' kubernetes_sd_configs: - role: pod scheme: http relabel_configs: - action: keep source_labels: [__meta_kubernetes_namespace] regex: npu-exporter - source_labels: [__meta_kubernetes_pod_node_name] target_label: job replacement: ${1} ... - Run the following command to add a label to the management node:
kubectl label nodes <Host name of the management node> masterselector=dls-master-node --overwrite=true
- Upload prometheus.yaml to any directory on the node in 2.
- Run the following command in the directory where the prometheus.yaml file is stored to start the Prometheus service:
kubectl apply -f prometheus.yaml
If the following information is displayed, the installation is successful:
1 2 3 4 5 6 7
[root@centos check_env]# kubectl apply -f prometheus.yaml clusterrole.rbac.authorization.k8s.io/prometheus created serviceaccount/prometheus created clusterrolebinding.rbac.authorization.k8s.io/prometheus created service/prometheus created deployment.apps/prometheus created configmap/prometheus-config created
- Run the following command to check whether Prometheus is started successfully:
kubectl get pods --all-namespaces | grep prometheus
The following is a startup example. If Running is displayed, Prometheus is started successfully.
1kube-system prometheus-58c69548b4-rhxsc 1/1 Running 0 6d14h
- Log in to Prometheus and view the monitoring data.
- Open the browser.
- Enter http://IP address of the management node:Port number in the browser address box and press Enter.
Find the nodePort field in the prometheus.yaml file. The value of this field is the port number of the Prometheus service, which is 30003 by default.
- Choose NPU-related labels to view corresponding data.
Interconnecting with Prometheus via Prometheus Operator
- Run the following command to obtain the source code of Prometheus Operator:
git clone https://github.com/prometheus-operator/kube-prometheus.git
- Obtain the Prometheus Operator source code branch that matches Kubernetes from the compatibility list in the official document.
- If Prometheus Operator and Prometheus have been installed, proceed to Step 4.
- Install Prometheus Operator.
- Run the following command to install Prometheus Operator:
kubectl create -f manifests/setup/
If the following information is displayed, Prometheus Operator is successfully installed:1 2 3 4 5
namespace/monitoring created ... deployment.apps/prometheus-operator created service/prometheus-operator created serviceaccount/prometheus-operator created
- Run the following command to check whether Prometheus Operator is started successfully:
kubectl get pod -A -o wide|grep prometheus-operator
The following is a startup example. If Running is displayed, Prometheus Operator is started successfully.1monitoring prometheus-operator-7649c7454f-wp84n 2/2 Running 0 58s 192.168.xx.xx node133 <none> <none>
- Run the following command to install Prometheus Operator:
- Install Prometheus.
- Go to the mindcluster-deploy repository, access the corresponding branch based on mindcluster-deploy Version Description, and obtain the prometheus.yaml file in the samples/utils/prometheus/base directory.
- Upload the prometheus.yaml file obtained in 1 to any directory in the environment.
- In the directory where prometheus.yaml is stored, run the following command to install Prometheus:
kubectl apply -f prometheus.yaml
If the following information is displayed, the installation is successful:
1 2 3 4 5
service/prometheus created prometheus.monitoring.coreos.com/prometheus created serviceaccount/prometheus-service-account created clusterrole.rbac.authorization.k8s.io/prometheus-cluster-role created clusterrolebinding.rbac.authorization.k8s.io/prometheus-cluster-role-binding created
- Run the following command to check whether Prometheus is started successfully:
kubectl get pods --all-namespaces | grep prometheus
Command output:
1 2
kube-system prometheus-prometheus-0 2/2 Running 1 3m47s 192.168.xx.xx node133 <none> <none> monitoring prometheus-operator-7649c7454f-wp84n 2/2 Running 0 5m52s 192.168.xx.xx node133 <none> <none>
- Interconnect NPU Exporter with Prometheus via Prometheus Operator.
- Obtain npu-exporter-svc.yaml and servicemonitor.yaml.
If Prometheus has been installed, the following field in the servicemonitor.yaml file must be the same as matchLabels configured by serviceMonitorSelector in Prometheus.
... labels: serviceMonitorSelector: prometheus ...You can run the following command to query matchLabels:
kubectl describe pod <pod-name>
- (Optional) Modify the NPU Exporter label as required. Otherwise, skip this step.
- In the npu-exporter-svc.yaml file, modify the label as required.
apiVersion: v1 kind: Service metadata: namespace: npu-exporter # The namespace is npu-exporter. name: npu-exporter labels: app: npu-exporter-svc # Label of NPU Exporter service spec: type: ClusterIP ports: - port: 8082 # Service port number of NPU Exporter targetPort: 8082 ... - In the servicemonitor.yaml file, modify the NPU Exporter label as required, which must be the same as that in the npu-exporter-svc.yaml file.
... spec: endpoints: - interval: 10s targetPort: 8082 # Service port number of NPU Exporter path: /metrics namespaceSelector: matchNames: - npu-exporter # The namespace is npu-exporter. selector: matchLabels: app: npu-exporter-svc # Label of the NPU Exporter service
- In the npu-exporter-svc.yaml file, modify the label as required.
- Run the following commands in sequence to interconnect NPU Exporter with Prometheus using Prometheus Operator:
kubectl apply -f servicemonitor.yaml kubectl apply -f npu-exporter-svc.yaml
- Run the following command to check whether NPU Exporter is successfully interconnected with Prometheus Operator:
kubectl get svc -A|grep npu-exporter
If the following information is displayed, NPU Exporter is successfully interconnected with Prometheus Operator.npu-exporter npu-exporter ClusterIP 10.98.xx.xx <none> 8082/TCP 31s
- Run the following command to check whether Prometheus Operator is successfully interconnected with Prometheus:
kubectl get servicemonitor -A|grep npu-exporter
If the following information is displayed, Prometheus Operator is successfully interconnected with Prometheus.kube-system npu-exporter 55s
- Obtain npu-exporter-svc.yaml and servicemonitor.yaml.
- Log in to Prometheus and view the monitoring data.
- Open the browser.
- Enter http://IP address of the management node:Port number in the browser address box and press Enter.
Find the nodePort field in the prometheus.yaml file. The value of this field is the port number of the Prometheus service, which is 30003 by default.
- Choose NPU-related labels to view corresponding data.