Symptoms of AI Core Errors

During task execution, an error is reported after an AI Core or AI Vector fails to be executed. The error code is EZ9999, and the following message is contained in the log: there is an xx aivec error exception or there is an xx aicore error exception. Also the error log Aicore kernel execute failed may exist in the plog.

The following is an example of the printed error message:

-----------------------------------------
   Ascend Error Message:
-----------------------------------------
EZ9999: Inner Error!
EZ9999: The error from device(chipid:4, dieId:0), serial number is 2, there is an aivec error exception, core id is 11, error code = 0x10, dump info: pc start: 0x1240c46650b8, vec error info: 0xd019ddc1a, mte error info: 0x2ffeba07af, ifu error info: 0x4e5c097530000, ccu error info: 0x30c255954a000023, cude error info: 0,0, aic error mask: 0x65000020bd000288, para base: 0x1240c51b1dd0.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1100]
        TraceBack (most recent call last):
        The extend info: errcode:(0x10, 0, 0) errorStr: Illegal instruction, which is usually caused by unaligned UUB addresses, fixp_error0 info: 0xeba07af, fixp_error, mId:0, tslot:0, threadId:0, ctxid:0, blk:0, sublk:0, subErrType:4.[FUNC:ProcessStarsCoreErrorInfo][FILE:device_error_proc.cc][LINE:1112]
        Aicore kernel execute failed, device_id=4, stream_id=450, report_stream_id=2, task_id=442, flip_num=0, fault kernel_name=00_131_Grandients/Default/AddN.op56419/program id=2089, hash=16296079633597215637.[FUNC:GetError][FILE:stream.cc][LINE:1467]
        [AIC_INFO] after execute:args print end[FUNC:GetError][FILE:stream.cc][LINE:1467]
        rtStreamSynchronize execute failed, reason=[The model stream execute failed][FUNC:FuncErrorReason][FILE:error_message_manage.cc][LINE:50]
(Please search "Ascend Error Message" at https://www.mindspore.cn for error code description)
-----------------------------------------
   C++ Call Stack: (For framework developers)
-----------------------------------------

Description:

  • chipId and dieId: respectively represent the reported chipId and dieId, which can be used to determine whether it is a fixed chipId error;
  • core id: indicates the core ID of the error-reporting chip, which can be used to determine whether the error occurred on the same core;
  • errcode and errorStr: respectively represent the error code and error description of the AI Core error;
  • fault kernel_name/fault kernel info ext: means the name of the kernel that caused the error. The name can be used to view the error-causing operator.

    In the scenario of executing asynchronous tasks, such as issuing multiple operator execution tasks continuously, multiple operators may report errors, so the error message may contain errors of multiple operators. You need to troubleshoot the problems starting from the first error-reporting operator.