Inconsistent Dispensation Before And After The Operator Input Arg

Analysis result

If the following analysis conclusion is provided in the info.txt file, the args error occurs.

"**********************Root cause conclusion******************"
The args of op is different before and after execute, args may be overwritten by other op. Please use oom to continue debug.

In the info.txt file, The following information is displayed in 4. Input and output of node:

****************4. Input and output of node*******************
input[0] addr: 0x124080042000 end_addr:0x124080042100 size: 0x100
input[1] addr: 0x124080022000 end_addr:0x124080022008 size: 0x8
input[2] addr: 0x0 end_addr:0x4 size: 0x4
output[0] addr: 0x0 end_addr:0x8 size: 0x8
workspace_bytes:0

args before execute: [[0x124080042000, 0x124080022000, 0x124080032000, 0x124080052000, 0x1240003e5070, 0x124080010000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x124080010000, 0x100000001, 0x100000040, 0x100000002]]
args after  execute: [[0x124080042000, 0x124080022000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x124080010000, 0x100000001, 0x100000040, 0x100000002]]

Fault root causes

Check the values of args before execute and args after execute in the preceding information, it shows the values of args before and after the dispensing are different. args is the input parameter of the operator kernel. The first several digits indicate the addresses of the input, output, workspace, and tiling_gm (memory for storing tiling data). If args is incorrect, an AI Core error may occur.

Solution

If the preceding information is displayed, perform the following operations:

Enable the memory corruption detection function of the operator (for details, see the following description), use the asys to rerun services to collect fault information, and use the https://gitee.com/ascend website to submit an issue for help.

Inference scenario: Perform ATC model conversion and enable the memory detection function by using the --op_debug_config debugging option.

Assume that the configuration file for enabling global memory detection is gm_debug.cfg, an example of the file content is as follows:

op_debug_config=ccec_O0,ccec_g,oom

Upload the file to any directory (for example, $HOME/module) on the server where ATC is located, an example is as follows:

--op_debug_config=$HOME/module/gm_debug.cfg

Training scenario: Modify the NPU's default configuration item npu.global_options().op_debug_config to enable memory detection.

You need to modify the default configuration items and set the global configuration items before initializing the NPU device. The following is an example:

import npu_device as npu
npu.global_options().op_debug_config="/root/gm_debug.cfg"
npu.open().as_default()

The gm_debug.cfg file contains the following information:

op_debug_config = ccec_O0,ccec_g,oom