Analyzing the Dump File of an Exception Operator

If a hardware issue happens onsite, repeated stress tests are needed to reproduce the issue, which slows down troubleshooting. To solve this problem, the system initiates a dump operation upon detecting a potential hardware issue, and captures the current status information. The msDebug tool parses the dump file of an exception operator. You can collect sufficient data for fault analysis even without a stress test. The above functions enhance hardware exception detection and minimize repetitive stress tests.

Procedure

  1. Prepare the acl.json configuration file.

    After the acl.json file is configured, other msDebug functions cannot be used.

  2. Enable the function of generating dump files for exception operators. For details, see the configuration file example (dump configuration of exception operators) in aclInit of Application Development Guide (C&C++).
    1. In the acl.json file, set dump_scene to aic_err_detail_dump.
    2. In the acl.json file, set dump_path to the path for storing dump files of exception operators.
  3. If the program crashes (for example, memory overflow or segmentation fault), a core file of the exception operator is generated, and the file name ends with .core.
  4. Run the following command with the msDebug tool to load the dump file of the exception operator:
    1
    2
    3
    4
    5
    6
    7
    msdebug --core output2/extra-info/data-dump/0/xxx.core add.fatbin
    msdebug(MindStudio Debugger) is part of MindStudio Operator-dev Tools.
    The tool provides developers with a mechanism for debugging Ascend kernels running on actual hardware.
    This enables developers to debug Ascend kernels without being affected by potential changes brought by simulation and emulation environments.
    (msdebug) target create "add.fatbin" --core "output2/extra-info/data-dump/0/xxx.core"
    Core file '/home/xxx/coredump_test/output2/extra-info/data-dump/0/xxx.core' (aarch64) was loaded.
    [Switching to focus on CoreId 26, Type aiv]
    

    To view the call stack, you need to use the -O2/O3 + -g option to compile and generate the kernel.o file that contains the debugging information or generate the ELF file in fatbin format.

    Cause: During operator execution, if a hardware exception occurs due to instruction execution, the hardware usually continues to execute several instructions before reporting the exception and generating a core file. Therefore, the memory and register data in the core file may be inaccurate. However, the value of the PC register is usually corrected. At the O2/O3 optimization level, the inline function is used by default. Call stack can still be traced accurately without requiring stack memory data. At the O0 optimization level, no inline function is used forcibly, and the stack memory data is inaccurate. Generally, accurate data requires using the 0 stack frame.

  5. View the dump file information of the exception operator.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    msdebug --core output2/extra-info/data-dump/0/xxx.core /home/xxxxx/Ascend/cann/opp/vendors/customize/op_impl/ai_core/tbe/kernel/ascend910b/add_custom/AddCustom_xxxx.o
    
    msdebug(MindStudio Debugger) is part of MindStudio Operator-dev Tools.
    The tool provides developers with a mechanism for debugging Ascend kernels running on actual hardware.
    This enables developers to debug Ascend kernels without being affected by potential changes brought by simulation and emulation environments.
    (msdebug) target create "/home/xxx/Ascend/cann/opp/vendors/customize/op_impl/ai_core/tbe/kernel/ascend910b/add_custom/AddCustom_xxx.o" --core "output2/extra-info/data-dump/0/xxx.core"
    Core file '/home/xxx/output2
    /extra-info/data-dump/0/xxx.core' (hiipu64) was loaded.
    [Switching to focus on CoreId 34, Type aiv]
    
    (msdebug) ascend info summary
      CoreId  CoreType        PC         DeviceId    ChipType
        33       AIV    0x12c0412004c8       0        A2/A3
     *  34       AIV    0x12c0412007c0       0        A2/A3
        35       AIV    0x12c0412007c0       0        A2/A3
        36       AIV    0x12c0412007c0       0        A2/A3
        37       AIV    0x12c0412007c0       0        A2/A3
        38       AIV    0x12c0412007c0       0        A2/A3
        39       AIV    0x12c0412007c0       0        A2/A3
        40       AIV    0x12c0412007c0       0        A2/A3
    
      Id           DataType                   MemType                     Addr                       Size             CoreId    CoreType    Dim
       0    DEVICE_KERNEL_OBJECT                GM                   0x12c041200000                 167872             NA         AIV        NA
       1            STACK                    GM/DCACHE           0xff000108000(invalid)              32768             33         AIV        NA
       2            STACK                    GM/DCACHE           0xff000110000(invalid)              32768             34         AIV        NA
       3            STACK                    GM/DCACHE           0xff000118000(invalid)              32768             35         AIV        NA
       4            STACK                    GM/DCACHE           0xff000120000(invalid)              32768             36         AIV        NA
       5            STACK                    GM/DCACHE           0xff000128000(invalid)              32768             37         AIV        NA
       6            STACK                    GM/DCACHE           0xff000130000(invalid)              32768             38         AIV        NA
       7            STACK                    GM/DCACHE           0xff000138000(invalid)              32768             39         AIV        NA
       8            STACK                    GM/DCACHE           0xff000140000(invalid)              32768             40         AIV        NA
       9      WORKSPACE_TENSOR                  GM                         0x0                         0               NA          NA        NA
      10         TILING_DATA                 GM/DCACHE               0x12c100000038                   16               NA          NA        NA
      11        OUTPUT_TENSOR                   GM                   0x12c0c0024000                  32768             NA          NA        [8, 2048]
      12        INPUT_TENSOR                    GM                   0x12c0c0012000                  32768             NA          NA        [8, 2048]
      13        INPUT_TENSOR                    GM                   0x12c0c001b000                  32768             NA          NA        [8, 2048]
      14            ARGS                     GM/DCACHE               0x12c100000000                   96               NA          NA        NA
    
    (msdebug) bt
       * thread #1, stop reason = VEC_ERROR
         * frame #0: 0x000012c0412004c8 AddCustom_xxx.o`::AddCustom_xxx_0(uint8_t *__gm__, uint8_t *__gm__, uint8_t *__gm__, u
       int8_t *__gm__, uint8_t *__gm__) [inlined] void AscendC::TPipe::ReleaseEventID<(AscendC::HardEvent)5>(this=<unavailable>, id=<unavailable>) at kernel_tpipe_impl.h:454:24
           frame #1: 0x000012c0412004c8 AddCustom_xxx.o`::AddCustom_xxx_0(uint8_t *__gm__, uint8_t *__gm__, uint8_t *__gm__, u
       int8_t *__gm__, uint8_t *__gm__) [inlined] AscendC::TQueBind<(AscendC::TPosition)0, (AscendC::TPosition)9, 2, 0>::AllocBuffer(this=<unavailable>) at kernel_tquebind_impl.h:512:3
       6
           frame #2: 0x000012c041200474 AddCustom_xxx.o`::AddCustom_xxx_0(uint8_t *__gm__, uint8_t *__gm__, uint8_t *__gm__, u
       int8_t *__gm__, uint8_t *__gm__) [inlined] AscendC::LocalTensor<half> AscendC::TQueBind<(this=<unavailable>)0, (AscendC::TPosition)9, 2, 0>::AllocTensor<half>() at kernel_tquebi
       nd_impl.h:78:16
           frame #3: 0x000012c041200474 AddCustom_xxx.o`::AddCustom_xxx_0(uint8_t *__gm__, uint8_t *__gm__, uint8_t *__gm__, u
       int8_t *__gm__, uint8_t *__gm__) [inlined] KernelAdd::CopyIn(this=<unavailable>, progress=<unavailable>) at add_custom.cpp:42:57
           frame #4: 0x000012c041200474 AddCustom_xxx.o`::AddCustom_xxx_0(uint8_t *__gm__, uint8_t *__gm__, uint8_t *__gm__, u
       int8_t *__gm__, uint8_t *__gm__) at add_custom.cpp:33:13
           frame #5: 0x000012c04120039c AddCustom_xxx.o`::AddCustom_xxx_0(uint8_t *__gm__, uint8_t *__gm__, uint8_t *__gm__, u
       int8_t *__gm__, uint8_t *__gm__) [inlined] add_custom_0_tilingkey(x=<unavailable>, y=<unavailable>, z=<unavailable>, workspace=<unavailable>, tiling=<unavailable>) at add_custom
       .cpp:83:8
           frame #6: 0x000012c041200064 AddCustom_xxx.o`::AddCustom_xxx_0(uint8_t *__gm__, uint8_t *__gm__, uint8_t *__gm__, u
       int8_t *__gm__, uint8_t *__gm__) [inlined] ascendc_auto_gen_add_custom_kernel(x_in__=<unavailable>, y_in__=<unavailable>, z_out_=<unavailable>, workspace=<unavailable>, tiling=<
       unavailable>) at AddCustom_xxx_3800102_kernel.cpp:43:5
           frame #7: 0x000012c04120004c AddCustom_xxx.o`::AddCustom_xxx_0(x_in__=<unavailable>, y_in__=<unavailable>, z_out_=<
       unavailable>, workspace=<unavailable>, tiling=<unavailable>) at AddCustom_xxx_3800102_kernel.cpp:48:5
    
  6. Locate the hardware fault by referring to the memory printing operations in Switching Cores, Reading Register Values, and Printing Memory and Variables.
  7. After the debugging is complete, run the q command and enter Y or y to end the debugging.
    1
    2
    (msdebug) q
    Quitting LLDB will kill one or more processes. Do you really want to proceed: [Y/n] y