iCache Data Verification Fault

Symptom

The slog log of device (report/*/slog/dev-os-id/[run|debug]/device-os/device-os_*.log) contains the keyword [fault_manager] event_id: [0x80C98000].

2024-04-22-09-06-17/hisi_logs/device-2/20240422090623-533810000/log/ts.log:5177:[ERROR] TSCH(-1,null):2024-04-20-17:02:52.772.875 35906 (dieid:0,cpuid:0) aicore.c:767 stars_print_error_pc_icache_and_hbm_info: stat f or dump pc start, aiv_id=47, icache_miss_num=8161, hbm_miss_num=0, compare_num=32, compare_fail_num=
[ERROR] TSCH(-1,null):2024-09-04-00:12:52.986.322 438 (dieid:0,cpuid:0) aicore_icache_plat.c:848 check_error_pc_icache_and_hbm_info: stat for dump pc start, aic_id=1, icache_miss_num=8176, hbm_miss_num=0, compare_num=17, compare_fail_num=0  

Fault root causes

Locate the error keyword and check the value of compare_fail_num. If the value is not 0, the iCache memory bit error occurs.

Solution

Search for Health Management Fault Definition of the corresponding version. The iCache memory bit error is described as follows (some key fields are listed).

Event ID

0x80C98000

Fault Name

The AI Core instruction data fails to be verified.

Fault Description/Possible Cause

The iCache data is inconsistent with the GM data. The possible causes are as follows:

  1. ICache data changes.
  2. The GM data is modified.

Impact

The current AI task fails. If the AI Core is not restored, subsequent AI tasks also fail.

Automatic Fault Resolution Mode

  1. TSFW reports faults to the fault management module through TSDrv.
  2. TSFW records error logs;
  3. TSFW returns a task failure message through the service plane.
  4. TSFW resets the AIC. If the AIC is successfully reset, TSFW reports through TSDrv to clear the fault events. If the AIC fails to be reset, the core is removed (after the core is isolated, it will no longer be used for service scheduling) and an error log is recorded.

System Handling Suggestion

  1. Exit the AI training job and execute it again or initiate an inference request again.
  2. If the AI task fails to be executed again, you are advised to reset the SoC. If the fault persists, you are advised to return the device to the factory for repair.