How Do I Obtain Fault IDs from Device Logs and Rectify RAS Hardware Faults?
Figure 1 Troubleshooting process
- On the host server, use the msnpureport tool to export device logs, including slog, syslog, and Black Box logs.Run the msnpureport tool in a directory on which you have the read, write, and execute permissions (for example, /var/log/npu/report). The following is an example of the msnpureport tool command. /usr/local/Ascend is the default installation path of the driver package. Replace it with the actual path.
/usr/local/Ascend/driver/tools/msnpureport -f
The exported device logs are stored in the /var/log/npu/report directory by default.
- From the slogs collected in 1, go to the report/*/slog/dev-os-id/[run|debug]/device-os/device-os_*.log directory to find the system logs for the corresponding device around the time when the error occurred. Check if the keyword event_id exists in the logs. If it does not exist, proceed to 3 to continue troubleshooting. If it exists, click Link to search for Health Management Fault Definition of the corresponding product and refer to the troubleshooting methods provided.
If the time in the slog is not around the time when the error occurs, the old log may have been overwritten or deleted. In this case, event_id related to the error cannot be found.
You can run the npu-smi command to query the health status of a specified chip. If an RAS fault occurs, you can query the event IDs of the last eight faults, which can be used as a reference for fault locating.
The following is an example of the npu-smi command (id indicates the device ID, and chip_id indicates the chip ID. You can run the npu-smi info command to obtain the device ID and chip ID first):
npu-smi info -t health -i id -c chip_id
The following is an example of the query result:

- From the black box logs collected in 1, find the logs in the report/*/hisi_logs directory that correspond to the device and the time period when the error occurs. Check if there is the keyword Hardware Error in the logs. If not, it means that no hardware failure has been identified. If there is, it means an unknown hardware issue, and it is necessary to contact technical support for further fault locating. After obtaining the logs, click here to contact technical support.
Parent topic: Common Locating Operations