AI Core Hardware Fault

Symptom

The host application log file log/[run|debug]/plog/plog-pid_*.log contains the following error information:

plog/plog-377228_20240402154414713.log:70:[ERROR] RUNTIME(377228,ascend-dmi):2024-04-02-15:49:01.883.834 [device_error_proc.cc:1164] 377265 ProcessStarsCoreErrorInfo:The error from device(chipId:2, dield:0), serial number is 4, there is an fftsplus aivector error exception, core id is 34, error code = 0, dump info: pc start: 0x12406ce78f54, current: 0x12406ce7C430, vec error info: 0xef1d17c368, mte error info: 0xf3dff1cf9b, ifu error info: 0x63f70dc00e200, ccu error info: 0x6f6d518f0004d700, cube error info: 0, biu error info: 0, aic error mask: 0x6500020bd000288, para base: 0x1240403e5000.
plog/plog-377228_20240402154414713.log:71:[ERROR] RUNTIME(377228,ascend-dmi):2024-04-02-15:49:01.883.875 [device_error_proc.cc:1176] 377265 ProcessStarsCoreErrorInfo: report error module_type=5, module_name=EZ9999
plog/plog-377228_20240402154414713.log:72:[ERROR] RUNTIME(377228,ascend-dmi):2024-04-02-15:49:01.883.876 [device_error_proc.cc:1176] 377265 ProcessStarsCoreErrorInfo: The extend info: errcode:(0, 0x80000000, 0) errorStr: A 2-bit ECC error occurs in the data-cache data-ram. fixp_error0 info: 0xff1cf9b, fixp_error1 info: 0xf3 fsmId:0, tslot:0, ctxid:0, blk:16, subblk:0, subErrType:4.

Fault Root Causes

Use the ascend-dmi tool to perform the stress test on the AI Core. If the stress test is abnormal and EMERGENCY_WARN is displayed, the AI Core is faulty.

The ascend-dmi tool needs to be installed separately. The following is an example of the command for performing a pressure test on the AI Core:

ascend-dmi --dg -i aicore

The ascend-dmi tool is contained in the MindX DL software package. For details about the mapping between the software and CANN, click Link. For details about how to install and use the ascend-dmi tool, see Link.

Solution

Contact technical support to replace the hardware. After obtaining the logs, click here to contact technical support.