Real-Time Device Status

Function

Check the status of the device in running.

Parameters

You can run either of the following commands to see parameters of the command for querying the real-time device status:

ascend-dmi -i -h

ascend-dmi -i --help

Table 1 lists only test-specific parameters. For details about other common parameters, see Common Parameters.

Table 1 Parameter description

Parameter

Description

Mandatory

[-i, --info]

Displays the real-time status of a device.

Yes

[-b, --brief]

Displays basic information about a chip.

No

[-dt, --dt, --detail]

Displays detailed information about a chip.

No

Leave --dt and -b unspecified

Displays basic information about a chip by default.

No

Example

  • Query detailed information about a chip.

    ascend-dmi -i --dt

    The following are examples of the queried chip details returned by each type of server. If the corresponding information is returned, the tool is running properly.

    1. Inference server
      Figure 1 Example of querying real-time device status (inference server)
      • When you run the ascend-dmi -i -dt command to query the real-time device status, the value of the Memory Information field is DDR or on-chip memory information.
      • When you run the ascend-dmi -i command to query the real-time device status, the value of the Used Memory field is DDR or on-chip memory information.
    2. Training server
      Figure 2 Example of querying real-time device status (training server)
    3. Training card
      Figure 3 Example of querying real-time device status (Atlas 300T training card (model 9000))
    4. Atlas 200I A2 accelerator module
      Figure 4 Example of querying real-time device status (Atlas 200I DK A2 developer kit)
    5. Atlas 200 AI accelerator module
      Figure 5 Example of querying real-time device status (Atlas 200 AI accelerator module (RC))
      Figure 6 Example of querying real-time device status (Atlas 200 AI accelerator module (EP))

    Table 2 describes the server parameters in the preceding figures.

    Table 2 Parameters

    Parameter

    Description

    Product

    Type

    Indicates the chip model.

    Training server

    NPU Count

    Indicates the number of NPUs.

    Card Quantity

    Indicates the number of cards.

    Standard card

    Type

    Indicates the standard card model.

    Card Manufacturer

    Indicates the card manufacturer.

    Card Serial Number

    Indicates the serial number of the card.

    Card ID

    Indicates the ID of the card.

    Real-time Card Power (W)

    Indicates the real-time power consumption in W.

    Device Count

    Indicates the number of devices (NPUs).

    Chip Name

    Indicates the chip name.

    Standard card and training server

    Device ID

    Indicates the logic chip ID.

    Chip ID

    Indicates the chip ID.

    DIE ID

    Indicates the DIE ID of a chip.

    AI Core Information

    Indicates the AI Core information, including:

    • AI Core Count: number of AI Cores
    • AI Core Usage (%): AI Core usage
    • Cube Count: number of cubes
    • Vector Count: number of vectors

    CPU Information

    Indicates the CPU information, including:

    • AI CPU Count: number of AI CPUs
    • AI CPU Usage (%): AI CPU usage
    • Control CPU Count: number of Ctrl CPUs
    • Control CPU Usage (%): Ctrl CPU usage
    • Control CPU Frequency (MHz): frequency of the Ctrl CPU

    Memory Information

    Indicates the memory information, including:

    • Total (MB): total memory capacity in MB
    • Used (MB): used memory
    • Bandwidth Usage (%): memory bandwidth usage.
    • Frequency (MHz): memory frequency in MHz

    Power Information

    Indicates the power consumption information, including:

    • Real-time Power (W): real-time power consumption (available only when the command is executed on a training server)
    • Rated Power (W): rated power of the processor
      NOTE:

      Atlas A3 training product and Atlas A3 inference product contain multiple NPUs. Their power consumption shown in the JSON file is displayed at the device-level, which actually specifies the power consumption of the entire NPU.

    Temperature (°C)

    Indicates the chip temperature.

    voltage(V)

    Indicates the voltage, in volt.

    health

    Displays the health information.

    PCIe Information

    Indicates the PCIe information, including:

    • Domain: PCIe domain
    • Bus: PCIe bus number
    • Device: PCIe device ID
    • Bus ID: PCIe bus address
    • Subvendor ID: subsystem vendor ID
    • Subdevice ID: subdevice ID
    • LnkCap Speed: maximum link speed
    • LnkCap Width: maximum link bandwidth
    • LnkSta Speed: current speed of the link
    • LnkSta Width: current bandwidth of the link
    • CPU Affinity: CPU affinity

    Error Information

    Displays error information.

    Error Count

    Indicates the number of errors.

    ECC Information

    Displays ECC information.

    DDR/SRAM/HBM/NPU

    Indicates the memory type of the card. The options are as follows:

    • DDR
    • SRAM
    • HBM
    • NPU

    The following information is also contained:

    • Single-Bit Error Count: number of single-bit errors
    • Double-Bit Error Count: number of double-bit errors

    Standard card and training server

    When you run the ascend-dmi -i --dt command, the following situations may occur:

    • If you run this command as a non-root user, "<Access denied. Please switch to root and try again.>" is displayed for some check items.
    • If you run this command in a container, "Unknown" is displayed for some check items. To obtain the information, exit the container and run the command again.
  • Query basic information about a chip.

    ascend-dmi -i -b

    The following are examples of basic information about the queried chip returned by each type of server. If the corresponding information is returned, the tool is running properly.

    1. Inference server
      Figure 7 Example of querying real-time device status (inference server)
    2. Training server
      Figure 8 Example of querying real-time device status (training server)
    3. Atlas 300T training card
      Figure 9 Example of querying real-time device status (Atlas 300T Pro training card (model 9000))
    4. Atlas 200I A2 accelerator module
      Figure 10 Example of querying real-time device status (Atlas 200I DK A2 developer kit)
    5. Atlas 200 AI accelerator module
      Figure 11 Example of querying real-time device status (Atlas 200 AI accelerator module)

    Table 3 describes the server parameters in the preceding figures.

    Table 3 Parameter description

    Parameter

    Description

    Product

    Type

    Indicates the standard card model.

    Standard card

    Card

    Card ID

    NPU Count

    Indicates the number of NPUs.

    Real-time Card Power

    Indicates the actual power consumption of the card.

    Chip

    Indicates the chip number.

    Name

    Indicates the chip name.

    Type

    Indicates the chip model.

    Training server

    NPU Count

    Indicates the number of NPUs.

    Chip Name

    Indicates the chip name.

    Power

    Indicates the power consumption.

    Health

    Indicates the chip health status.

    Standard card and training server

    Used Memory

    Indicates the memory used.

    Temperature

    Indicates the current temperature of the chip.

    Voltage

    Indicates the current voltage of the chip.

    Device ID

    Indicates the logic chip ID.

    Bus ID

    PCIe bus address

    AI Core Usage

    Indicates the AI Core usage of the chip.