Working with Telegraf

This section describes how to install and deploy Telegraf and view resource monitoring data on Telegraf. For details about the data, see Telegraf Data Description.

Binary Integration Using Telegraf

In addition to binary integration, cluster scheduling supports integration of the Telegraf source code by modifying the NPU Exporter open-source code.

  1. (Optional) If no log directory for NPU Exporter is created, run the following commands in sequence to create one:
    mkdir -m 750 /var/log/mindx-dl/npu-exporter
    chown hwMindX:hwMindX /var/log/mindx-dl/npu-exporter
  2. Obtain the NPU Exporter software package from the Ascend Community, decompress it to obtain its binary file npu-exporter, and upload the file to any path (for example, /home/npu_plugin) in the environment.
  3. Run the following command to create the npu_plugin.conf file:
    vi npu_plugin.conf

    Add the NPU Exporter binary file path to the file. The following is an example:

    [[inputs.execd]]
      command = ["/home/npu_plugin/npu-exporter", "-platform=Telegraf", "-poll_interval=10s", "-hccsBWProfilingTime=200"] 
      signal = "none"  
    [[outputs.file]] 
      files=["stdout"]

    Table 1 describes the input parameters of the command field.

    Table 1 Parameters

    Parameter

    Type

    Default Value

    Value Description

    Required (Yes/No)

    -platform

    String

    Prometheus

    Platform to be connected. The options are as follows:

    • Prometheus: Prometheus.
    • Telegraf: Telegraf

    Yes

    -poll_interval

    Duration (integer)

    1s

    Interval for reporting Telegraf data. This parameter takes effect only when the Telegraf platform is connected. That is, this parameter takes effect only when -platform is set to Telegraf.

    No

    -hccsBWProfilingTime

    Integer

    200

    Duration for sampling the HCCS link bandwidth. The value ranges from 1 to 1000, in ms.

    No

  4. (Optional) If Telegraf is not installed, perform the following steps to install Telegraf.
    • (Recommended) Offline installation
      1. Go to the Telegraf download page.
      2. Select the version to be installed and download it, for example, telegraf-1.34.3_linux_arm64.tar.gz.
      3. Upload the installation package to any directory on the server.
      4. Decompress the package in the directory where it is stored. Example:
        tar -zxvf telegraf-1.34.3_linux_arm64.tar.gz
      5. Go to the decompression directory, find the Telegraf binary file in the ./usr/bin directory, and copy the file to any directory (for example, /home/npu_plugin).
    • Online installation
      1. Go to the Telegraf download page.
      2. Select the OS and Telegraf version from the drop-down list.
        Figure 1 Downloading Telegraf
      3. Copy the installation command from the dialog box to the target device and execute it to complete the installation.
  5. Run Telegraf.
    • If offline installation is used, run the following command to run Telegraf:
      ./telegraf --config npu_plugin.conf
    • If online installation is used, run the following command to run Telegraf:
      telegraf --config npu_plugin.conf
      After Telegraf is executed successfully, the following information is displayed. The information in bold is the data of the Ascend AI processor.
      2023-09-15T10:11:31Z I! Loading config file: ../npu_plugin.conf
      2023-09-15T10:11:31Z I! Starting Telegraf 1.26.0
      2023-09-15T10:11:31Z I! Available plugins: 236 inputs, 9 aggregators, 27 processors, 22 parsers, 57 outputs, 2 secret-stores2023-09-15T10:11:31Z I! Loaded inputs: execd
      2023-09-15T10:11:31Z I! Loaded aggregators: 
      2023-09-15T10:11:31Z I! Loaded processors: 
      2023-09-15T10:11:31Z I! Loaded secretstores: 
      2023-09-15T10:11:31Z I! Loaded outputs: file
      2023-09-15T10:11:31Z I! Tags enabled: host=xxx
      2023-09-15T10:11:31Z I! [agent] Config: Interval:10s, Quiet:false, Hostname:"xxx", Flush Interval:10s
      2023-09-15T10:11:31Z I! [inputs.execd] Starting process: /xxx/npu-exporter [-platform=Telegraf -poll_interval=1m]
      Ascend910-0,host=xxx npu_chip_link_speed=104857600000i,npu_chip_roce_rx_cnp_pkt_num=0i,npu_chip_roce_unexpected_ack_num=0i,npu_chip_optical_vcc=3245.1,npu_chip_optical_rx_power_1=0.8585,npu_chip_info_hbm_used_memory=0i,npu_chip_mac_rx_pause_num=0i,npu_chip_roce_tx_all_pkt_num=0i,npu_chip_roce_tx_cnp_pkt_num=0i,npu_chip_info_temperature=46,npu_chip_mac_rx_bad_pkt_num=0i,npu_chip_roce_tx_err_pkt_num=0i,npu_chip_optical_rx_power_3=0.8466,npu_chip_optical_rx_power_0=0.7933,npu_chip_info_network_status=0i,npu_chip_mac_rx_pfc_pkt_num=0i,npu_chip_mac_tx_bad_pkt_num=0i,npu_chip_roce_rx_all_pkt_num=0i,npu_chip_mac_rx_bad_oct_num=0i,npu_chip_optical_tx_power_1=0.9162,npu_chip_info_utilization=0,npu_chip_info_power=73.9000015258789,npu_chip_info_link_status=1i,npu_chip_info_bandwidth_rx=0,npu_chip_mac_tx_pfc_pkt_num=0i,npu_chip_roce_rx_err_pkt_num=0i,npu_chip_roce_verification_err_num=0i,npu_chip_optical_state=1i,npu_chip_info_bandwidth_tx=0,npu_chip_mac_tx_bad_oct_num=0i,npu_chip_roce_out_of_order_num=0i,npu_chip_roce_qp_status_err_num=0i,npu_chip_optical_rx_power_2=0.855,npu_chip_optical_tx_power_0=0.9095,npu_chip_info_hbm_utilization=0,npu_chip_link_up_num=2i,npu_chip_info_health_status=1i,npu_chip_mac_tx_pause_num=0i,npu_chip_roce_new_pkt_rty_num=0i,npu_chip_optical_temp=53,npu_chip_optical_tx_power_2=1.0342,npu_chip_optical_tx_power_3=0.9715 1694772754612200641