Analyzing the Dump File of an Exception Operator
If a hardware issue happens onsite, repeated stress tests are needed to reproduce the issue, which slows down troubleshooting. To solve this problem, the system initiates a dump operation upon detecting a potential hardware issue, and captures the current status information. The msDebug tool parses the dump file of an exception operator. You can collect sufficient data for fault analysis even without a stress test. The above functions enhance hardware exception detection and minimize repetitive stress tests.
Procedure
- Prepare the acl.json configuration file.
- Project-based operator development: single-operator API calling: Create the acl.json file by referring to "Initialization and Deinitialization" in the Application Development Guide (C&C++) and load the file using the aclinit API.
- AI framework operator adaptation: PyTorch framework: Search for the acl.json file in the installation directory of torch_npu.
After the acl.json file is configured, other msDebug functions cannot be used.
- Enable the function of generating dump files for exception operators. For details, see the configuration file example (dump configuration of exception operators) in aclInit of Application Development Guide (C&C++).
- In the acl.json file, set dump_scene to aic_err_detail_dump.
- In the acl.json file, set dump_path to the path for storing dump files of exception operators.
- If the program crashes (for example, memory overflow or segmentation fault), a core file of the exception operator is generated, and the file name ends with .core.
- Run the following command with the msDebug tool to load the dump file of the exception operator:
1 2 3 4 5 6 7
msdebug --core output2/extra-info/data-dump/0/xxx.core add.fatbin msdebug(MindStudio Debugger) is part of MindStudio Operator-dev Tools. The tool provides developers with a mechanism for debugging Ascend kernels running on actual hardware. This enables developers to debug Ascend kernels without being affected by potential changes brought by simulation and emulation environments. (msdebug) target create "add.fatbin" --core "output2/extra-info/data-dump/0/xxx.core" Core file '/home/xxx/coredump_test/output2/extra-info/data-dump/0/xxx.core' (aarch64) was loaded. [Switching to focus on CoreId 26, Type aiv]
To view the call stack, you need to use the -O2/O3 + -g option to compile and generate the kernel.o file that contains the debugging information or generate the ELF file in fatbin format.
Cause: During operator execution, if a hardware exception occurs due to instruction execution, the hardware usually continues to execute several instructions before reporting the exception and generating a core file. Therefore, the memory and register data in the core file may be inaccurate. However, the value of the PC register is usually corrected. At the O2/O3 optimization level, the inline function is used by default. Call stack can still be traced accurately without requiring stack memory data. At the O0 optimization level, no inline function is used forcibly, and the stack memory data is inaccurate. Generally, accurate data requires using the 0 stack frame.
- View the dump file information of the exception operator.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60
msdebug --core output2/extra-info/data-dump/0/xxx.core /home/xxxxx/Ascend/cann/opp/vendors/customize/op_impl/ai_core/tbe/kernel/ascend910b/add_custom/AddCustom_xxxx.o msdebug(MindStudio Debugger) is part of MindStudio Operator-dev Tools. The tool provides developers with a mechanism for debugging Ascend kernels running on actual hardware. This enables developers to debug Ascend kernels without being affected by potential changes brought by simulation and emulation environments. (msdebug) target create "/home/xxx/Ascend/cann/opp/vendors/customize/op_impl/ai_core/tbe/kernel/ascend910b/add_custom/AddCustom_xxx.o" --core "output2/extra-info/data-dump/0/xxx.core" Core file '/home/xxx/output2 /extra-info/data-dump/0/xxx.core' (hiipu64) was loaded. [Switching to focus on CoreId 34, Type aiv] (msdebug) ascend info summary CoreId CoreType PC DeviceId ChipType 33 AIV 0x12c0412004c8 0 A2/A3 * 34 AIV 0x12c0412007c0 0 A2/A3 35 AIV 0x12c0412007c0 0 A2/A3 36 AIV 0x12c0412007c0 0 A2/A3 37 AIV 0x12c0412007c0 0 A2/A3 38 AIV 0x12c0412007c0 0 A2/A3 39 AIV 0x12c0412007c0 0 A2/A3 40 AIV 0x12c0412007c0 0 A2/A3 Id DataType MemType Addr Size CoreId CoreType Dim 0 DEVICE_KERNEL_OBJECT GM 0x12c041200000 167872 NA AIV NA 1 STACK GM/DCACHE 0xff000108000(invalid) 32768 33 AIV NA 2 STACK GM/DCACHE 0xff000110000(invalid) 32768 34 AIV NA 3 STACK GM/DCACHE 0xff000118000(invalid) 32768 35 AIV NA 4 STACK GM/DCACHE 0xff000120000(invalid) 32768 36 AIV NA 5 STACK GM/DCACHE 0xff000128000(invalid) 32768 37 AIV NA 6 STACK GM/DCACHE 0xff000130000(invalid) 32768 38 AIV NA 7 STACK GM/DCACHE 0xff000138000(invalid) 32768 39 AIV NA 8 STACK GM/DCACHE 0xff000140000(invalid) 32768 40 AIV NA 9 WORKSPACE_TENSOR GM 0x0 0 NA NA NA 10 TILING_DATA GM/DCACHE 0x12c100000038 16 NA NA NA 11 OUTPUT_TENSOR GM 0x12c0c0024000 32768 NA NA [8, 2048] 12 INPUT_TENSOR GM 0x12c0c0012000 32768 NA NA [8, 2048] 13 INPUT_TENSOR GM 0x12c0c001b000 32768 NA NA [8, 2048] 14 ARGS GM/DCACHE 0x12c100000000 96 NA NA NA (msdebug) bt * thread #1, stop reason = VEC_ERROR * frame #0: 0x000012c0412004c8 AddCustom_xxx.o`::AddCustom_xxx_0(uint8_t *__gm__, uint8_t *__gm__, uint8_t *__gm__, u int8_t *__gm__, uint8_t *__gm__) [inlined] void AscendC::TPipe::ReleaseEventID<(AscendC::HardEvent)5>(this=<unavailable>, id=<unavailable>) at kernel_tpipe_impl.h:454:24 frame #1: 0x000012c0412004c8 AddCustom_xxx.o`::AddCustom_xxx_0(uint8_t *__gm__, uint8_t *__gm__, uint8_t *__gm__, u int8_t *__gm__, uint8_t *__gm__) [inlined] AscendC::TQueBind<(AscendC::TPosition)0, (AscendC::TPosition)9, 2, 0>::AllocBuffer(this=<unavailable>) at kernel_tquebind_impl.h:512:3 6 frame #2: 0x000012c041200474 AddCustom_xxx.o`::AddCustom_xxx_0(uint8_t *__gm__, uint8_t *__gm__, uint8_t *__gm__, u int8_t *__gm__, uint8_t *__gm__) [inlined] AscendC::LocalTensor<half> AscendC::TQueBind<(this=<unavailable>)0, (AscendC::TPosition)9, 2, 0>::AllocTensor<half>() at kernel_tquebi nd_impl.h:78:16 frame #3: 0x000012c041200474 AddCustom_xxx.o`::AddCustom_xxx_0(uint8_t *__gm__, uint8_t *__gm__, uint8_t *__gm__, u int8_t *__gm__, uint8_t *__gm__) [inlined] KernelAdd::CopyIn(this=<unavailable>, progress=<unavailable>) at add_custom.cpp:42:57 frame #4: 0x000012c041200474 AddCustom_xxx.o`::AddCustom_xxx_0(uint8_t *__gm__, uint8_t *__gm__, uint8_t *__gm__, u int8_t *__gm__, uint8_t *__gm__) at add_custom.cpp:33:13 frame #5: 0x000012c04120039c AddCustom_xxx.o`::AddCustom_xxx_0(uint8_t *__gm__, uint8_t *__gm__, uint8_t *__gm__, u int8_t *__gm__, uint8_t *__gm__) [inlined] add_custom_0_tilingkey(x=<unavailable>, y=<unavailable>, z=<unavailable>, workspace=<unavailable>, tiling=<unavailable>) at add_custom .cpp:83:8 frame #6: 0x000012c041200064 AddCustom_xxx.o`::AddCustom_xxx_0(uint8_t *__gm__, uint8_t *__gm__, uint8_t *__gm__, u int8_t *__gm__, uint8_t *__gm__) [inlined] ascendc_auto_gen_add_custom_kernel(x_in__=<unavailable>, y_in__=<unavailable>, z_out_=<unavailable>, workspace=<unavailable>, tiling=< unavailable>) at AddCustom_xxx_3800102_kernel.cpp:43:5 frame #7: 0x000012c04120004c AddCustom_xxx.o`::AddCustom_xxx_0(x_in__=<unavailable>, y_in__=<unavailable>, z_out_=< unavailable>, workspace=<unavailable>, tiling=<unavailable>) at AddCustom_xxx_3800102_kernel.cpp:48:5
- Locate the hardware fault by referring to the memory printing operations in Switching Cores, Reading Register Values, and Printing Memory and Variables.
- After the debugging is complete, run the q command and enter Y or y to end the debugging.
1 2
(msdebug) q Quitting LLDB will kill one or more processes. Do you really want to proceed: [Y/n] y