Querying Real-Time Device Status

Function

Check the running status of the device in real time.

Commands for Querying Test Parameters

You can run either of the following commands to list the parameters of the command for querying the real-time device status:

ascend-dmi -i -h

ascend-dmi -i --help

Table 1 describes the parameters.

Table 1 Parameter description

Parameter

Description

Mandatory

[-i, --info]

Displays the real-time status of a device.

Yes

[-b, --brief]

Displays basic information about a processor.

No

[-dt, --dt, --detail]

Displays detailed information about a processor.

No

Leave --dt and -b unspecified

Displays basic information about the processor by default.

No

[-fmt, --fmt, --format]

Specifies the output format. The value can be normal or json. If this parameter is not specified, the default value normal is used.

No

Example

  • Query detailed information about a processor.

    ascend-dmi -i --dt

    The following are examples of the queried processor details returned by each type of server. If the corresponding information is returned, the tool is running properly.

    1. Inference server (The Ascend 310 AI Processor is used as an example.)
      Figure 1 Example of querying real-time device status (inference server)
    2. Training server
      Figure 2 Example of querying real-time device status (training server)
    3. Atlas 300T training card (model 9000)
      Figure 3 Example of querying real-time device status (Atlas 300T training card (model 9000))
    4. Atlas 200 AI accelerator module
      Figure 4 Example of querying real-time device status (Atlas 200 AI accelerator module (RC))
      Figure 5 Example of querying real-time device status (Atlas 200 AI accelerator module (EP))

    Table 2 describes the server parameters in the preceding figures.

    Table 2 Parameters

    Parameter

    Description

    Product

    Type

    Indicates the processor model.

    Training server

    NPU Count

    Indicates the number of NPUs.

    Card Quantity

    Indicates the number of cards.

    Standard card

    Type

    Indicates the standard card model.

    Card Manufacturer

    Indicates the card manufacturer.

    Card Serial Number

    Indicates the serial number of the card.

    Card ID

    Indicates the ID of the card.

    Real-time Card Power (W)

    Indicates the real-time power consumption in W.

    Device Count

    Indicates the number of devices (NPUs).

    Chip Name

    Indicates the processor name.

    Standard card and training server

    Device ID

    Indicates the ID of the device.

    Chip ID

    Indicates the processor ID.

    DIE ID

    Indicates the DIE ID of a processor.

    AI Core Information

    Displays the AI Core information, which includes the following:

    • AI Core Count: number of AI Cores
    • AI Core Usage (%): AI Core usage
    • Cube Count: number of cubes
    • Vector Count: number of vectors

    CPU Information

    Displays the CPU information, which includes the following:

    • AI CPU Count: number of AI CPUs
    • AI CPU Usage (%): AI CPU usage
    • Control CPU Count: number of Ctrl CPUs
    • Control CPU Usage (%): Ctrl CPU usage
    • Control CPU Frequency (MHz): frequency of the Ctrl CPU

    Memory Information

    Displays the memory information, which includes the following:

    • Total (MB): total memory capacity in MB
    • Used (MB): used memory
    • Bandwidth Usage (%): memory bandwidth usage
    • Frequency (MHz): memory frequency in MHz

    Power Information

    Displays the power consumption information, which includes the following:

    Real-time Power (W): real-time power consumption (available only when the command is executed on a training server)

    Temperature (°C)

    Indicates the processor temperature.

    PCIe Information

    Displays the PCIe information, which includes the following:

    • Domain: PCIe domain
    • Bus: PCIe bus number
    • Device: PCIe device ID
    • Bus ID: PCIe bus address
    • Subvendor ID: subsystem vendor ID
    • Subdevice ID: subdevice ID
    • LnkCap Speed: maximum link speed
    • LnkCap Width: maximum link bandwidth
    • LnkSta Speed: current speed of the link
    • LnkSta Width: current bandwidth of the link
    • CPU Affinity: CPU affinity

    Error Information

    Displays error information.

    Error Count

    Indicates the number of errors.

    ECC Information

    Displays ECC information.

    DDR

    Memory type of the card. The options are as follows:

    • DDR
    • SRAM
    • HBM
    • NPU

    The following information is also contained:

    • Single-Bit Error Count: number of single-bit errors
    • Double-Bit Error Count: number of double-bit errors

    When you run the ascend-dmi -i --dt command, the following situations may occur:

    • If you run this command as a non-root user, "<Access denied. Please switch to root and try again.>" is displayed for some check items. To obtain the information, switch to the root user and run the command again.
    • If you run this command in a container, "Unknown" is displayed for some check items. To obtain the information, exit the container and run the command again.
  • Query basic information about a processor.

    ascend-dmi -i -b

    The following are examples of basic information about the queried processor returned by each type of server. If the corresponding information is returned, the tool is running properly.

    1. Inference server (The Ascend 310 AI Processor is used as an example.)
      Figure 6 Example of querying real-time device status (inference server)
    2. Training server
      Figure 7 Example of querying real-time device status (training server)
    3. Atlas 300T Pro training card (model 9000)
      Figure 8 Example of querying real-time device status (Atlas 300T Pro training card (model 9000))
    4. Atlas 200 AI accelerator module
      Figure 9 Example of querying real-time device status (Atlas 200 AI accelerator module)

    Table 3 describes the server parameters in the preceding figures.

    Table 3 Parameter description

    Parameter

    Description

    Product

    Type

    Indicates the standard card model.

    Standard card

    Card

    Card ID

    NPU Count

    Indicates the number of NPUs.

    Real-time Card Power

    Indicates the actual power consumption of the card.

    Chip

    Indicates the processor number.

    Name

    Indicates the processor name.

    Type

    Indicates the processor model.

    Training server

    NPU Count

    Indicates the number of NPUs.

    Chip Name

    Indicates the processor name.

    Power

    Indicates the power consumption.

    Health

    Indicates the processor health status.

    Standard card and training server

    Used Memory

    Indicates the memory used.

    Temperature

    Indicates the current temperature of the processor.

    Voltage

    Indicates the current voltage of the processor.

    Device ID

    Indicates the processor device number.

    Bus ID

    PCIe bus address

    AI Core Usage

    Indicates the AI Core usage of the processor.