Inconsistent Dispensation Before And After The Operator Input Arg
Analysis result
If the following analysis conclusion is provided in the info.txt file, the args error occurs.
"**********************Root cause conclusion******************" The args of op is different before and after execute, args may be overwritten by other op. Please use oom to continue debug.
In the info.txt file, The following information is displayed in 4. Input and output of node:
****************4. Input and output of node******************* input[0] addr: 0x124080042000 end_addr:0x124080042100 size: 0x100 input[1] addr: 0x124080022000 end_addr:0x124080022008 size: 0x8 input[2] addr: 0x0 end_addr:0x4 size: 0x4 output[0] addr: 0x0 end_addr:0x8 size: 0x8 workspace_bytes:0 args before execute: [[0x124080042000, 0x124080022000, 0x124080032000, 0x124080052000, 0x1240003e5070, 0x124080010000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x124080010000, 0x100000001, 0x100000040, 0x100000002]] args after execute: [[0x124080042000, 0x124080022000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x124080010000, 0x100000001, 0x100000040, 0x100000002]]
Fault root causes
Check the values of args before execute and args after execute in the preceding information, it shows the values of args before and after the dispensing are different. args is the input parameter of the operator kernel. The first several digits indicate the addresses of the input, output, workspace, and tiling_gm (memory for storing tiling data). If args is incorrect, an AI Core error may occur.
Solution
If the preceding information is displayed, perform the following operations:
Enable the memory corruption detection function of the operator (for details, see the following description), use the asys to rerun services to collect fault information, and use the https://gitee.com/ascend website to submit an issue for help.
Inference scenario: Perform ATC model conversion and enable the memory detection function by using the --op_debug_config debugging option.
Assume that the configuration file for enabling global memory detection is gm_debug.cfg, an example of the file content is as follows:
op_debug_config=ccec_O0,ccec_g,oom
Upload the file to any directory (for example, $HOME/module) on the server where ATC is located, an example is as follows:
--op_debug_config=$HOME/module/gm_debug.cfg
Training scenario: Modify the NPU's default configuration item npu.global_options().op_debug_config to enable memory detection.
You need to modify the default configuration items and set the global configuration items before initializing the NPU device. The following is an example:
import npu_device as npu npu.global_options().op_debug_config="/root/gm_debug.cfg" npu.open().as_default()
The gm_debug.cfg file contains the following information:
op_debug_config = ccec_O0,ccec_g,oom
