DumpAccChkPoint

Function Usage

Dumps the content of specified tensors for operators developed based on operator projects and supports the printing of user-defined additional information (limited to the uint32_t data type), for example, the current line number. Unlike DumpTensor, this API can be used to print tensors with specified offset.

Call the DumpAccChkPoint API to print offset tensor data at the target location in the operator kernel implementation code. An example is as follows.

       
            AscendC::DumpAccChkPoint(srcLocal,5, 32, dataLen);

The printing function of DumpAccChkPoint affects the actual running performance of the operator. Therefore, this function is usually used in the commissioning phase. You can disable the printing function in either of the following ways:

Custom operator project

Modify the CMakeLists.txt file in the op_kernel directory of the operator project. Add the compilation option -DASCENDC_DUMP=0 to the first line to disable ASCENDC_DUMP. The following is an example.

             
                  // Disable the printf printing function of all operators.
add_ops_compile_options(ALL OPTIONS -DASCENDC_DUMP=0)

Kernel launch project

Modify the npu_lib.cmake file in the cmake directory. Add the -DASCENDC_DUMP=0 macro definition to the ascendc_compile_definitions command to disable the ASCENDC_DUMP function. The following is an example.

             
                  // Disable the printf printing function of all operators.
ascendc_compile_definitions(ascendc_kernels_${RUN_MODE} PRIVATE
    -DASCENDC_DUMP=0
)

During dump, the corresponding information header DumpHead (32 bytes) is added before the dump information of each block core to record the core ID and resource usage. The information header DumpTensorHead (32 bytes) is also added before the tensor data to be dumped each time to record tensor information. The information structure in the multi-core printing scenario is illustrated in the figure below.

The specific DumpHead information is as follows:

block_id: ID of the running core.
total_block_num: number of cores to be dumped.
block_remain_len: available dump space in the current core.
block_initial_space: initial dump space allocated in the current core.
magic: magic number for memory verification.

The specific DumpTensorHead information is as follows:

desc: user-defined additional information.
addr: tensor address.
data_type: tensor data type.
position: physical storage position of the tensor, which can only be Unified Buffer/L1 Buffer/L0C Buffer/Global Memory.

An example of the printing result is as follows:

      
           DumpHead: block_id=0, total_block_num=16, block_remain_len=1048448, block_initial_space=1048576, magic=5aa5bccd
DumpTensor: desc=5, addr=0, data_type=DT_FLOAT16, position=UB
[40, 82, 60, 11, 24, 55, 52, 60, 31, 86, 53, 61, 47, 54, 34, 62, 84, 29, 48, 95, 16, 0, 20, 77, 3, 55, 69, 73, 75, 40, 35, 13]
DumpHead: block_id=1, total_block_num=16, block_remain_len=1048448, block_initial_space=1048576, magic=5aa5bccd
DumpTensor: desc=5, addr=0, data_type=DT_FLOAT16, position=UB
[58, 84, 22, 54, 41, 93, 1, 45, 50, 9, 72, 81, 23, 96, 86, 45, 36, 9, 36, 34, 78, 7, 2, 29, 47, 26, 13, 24, 27, 55, 90, 5]
...
DumpHead: block_id=7, total_block_num=16, block_remain_len=1048448, block_initial_space=1048576, magic=5aa5bccd
DumpTensor: desc=5, addr=0, data_type=DT_FLOAT16, position=UB
[28, 27, 79, 39, 86, 5, 23, 97, 89, 5, 65, 69, 59, 13, 49, 2, 34, 6, 52, 38, 4, 90, 11, 11, 61, 50, 71, 98, 19, 54, 54, 99]

Prototype

      
           void DumpAccChkPoint(const GlobalTensor<T>& tensor, uint32_t index, uint32_t countOff, uint32_t dumpSize)
void DumpAccChkPoint(const LocalTensor<T>& tensor, uint32_t index, uint32_t countOff, uint32_t dumpSize)

Parameters

Parameter	Input/Output	Description
tensor	Input	Tensor to be dumped. If the tensor to be dumped is stored in Unified Buffer/L1 Buffer/L0C Buffer, use the tensor parameter input of the LocalTensor type. If the tensor to be dumped is stored in Global Memory, use the tensor parameter input of the GlobalTensor type.
index	Input	User-defined additional information (line numbers or other user-defined numbers).
dumpSize	Input	Number of elements to be dumped.
countOff	Input	Number of offset elements

Returns

None

Availability

Constraints

This function is used only for NPU on-board debugging and is supported only in the following scenarios:
- Calling operators in kernel launch mode.
- Calling operators through single-operator APIs.
- Calling a single-operator API (aclnnxxx) indirectly: single-operator calling in the PyTorch framework.
Currently, only information about tensors stored in Unified Buffer/L1 Buffer/L0C Buffer/Global Memory can be printed.
For details about the alignment requirements of the operand address offset, see General Restrictions.
The total length of elements to be dumped must be 32-byte aligned.
The offset must be 32-byte aligned. That is, the number of offset elements multiplied by sizeof(T) must be 32-byte aligned.
The sum size of the space used by the printf call, assert call, DumpTensor and DumpAccChkPoint call, and framework dump function cannot exceed 1 MB on each core. Developers need to control the amount of data to be printed. If the limit is exceeded, no content will be printed.

Example

      
           AscendC::DumpAccChkPoint(srcLocal, 7, 32 , 128);

Parent topic: Operator Debugging APIs