HBM Bit ECC Fault

Symptom

The slog log of device (report/*/slog/dev-os-id/[run|debug]/device-os/device-os_*.log) contains the keyword [fault_manager] event_id: [0x80E01809].

Device black box logs (in the report/*/hisi_logs directory) contain the keyword Hardware Error.

/hisi_logs/device-2/20240420162417-271713000/bbox/kbox.txt:2014[2454.830692] {6}[Hardware Error] Hardware error from APEI Generic Hardware Error Source: 0
/hisi_logs/device-2/20240420162417-271713000/bbox/kbox.txt:2015[2454.830693] {6}[Hardware Error]event severity: recoverable
/hisi_logs/device-2/20240420162417-271713000/bbox/kbox.txt:2016[2454.830694] {6}[Hardware Error] Error 0, type: recoverable
/hisi_logs/device-2/20240420162417-271713000/bbox/kbox.txt:2017[2454.830696] {6}[Hardware Error]  section_type: memory error
/hisi_logs/device-2/20240420162417-271713000/bbox/kbox.txt:2018[2454.830697] {6}[Hardware Error]  physical_address: 0x0000101efe36d3c0
/hisi_logs/device-2/20240420162417-271713000/bbox/kbox.txt:2019[2454.830699] {6}[Hardware Error]  node: 2 card: 259 module: 51 rank: 1 bank: 10 row: 30691  column: 56
/hisi_logs/device-2/20240420162417-271713000/bbox/kbox.txt:2020[2454.830701] {6}[Hardware Error]  error_type: 3, multi-bit ECC
/hisi_logs/device-2/20240420162417-271713000/bbox/kbox.txt:2024[2454.830702] {6}[Hardware Error]  DIMM location: not present. DMI handle: 0x0000

Fault Root Causes

Based on the error information in the logs, it can be determined that the AI Core error is caused by the HBM bit ECC fault.

Solution

Search for the Health Management Error Definition of the corresponding version. The HBM bit ECC fault is described as follows (some key fields are listed):

Event ID

0x80E01809

Event Name

Multi-bit ECC errors during the HBM patrol

Fault Description/Possible Cause

Multi-bit ECC errors are triggered during HBMC patrol scrubbing and demand scrubbing. The possible causes may be that some HBM chips fail or the HBM cannot store data properly.

Impact

1. If an incorrect address is accessed during the startup, the startup may fail.

2. If a service accesses an incorrect address, error data is returned, which may cause service failures.

3. The service does not access an incorrect address, and the current service is not affected.

System Action

1. Reports a notification event to the device management module, records the error address, attempts to isolate the device online, and performs offline isolation after the device is restarted.

2. The error log is recorded.

System Handling Suggestion

No operation is required.