DumpTensor

Function Usage

Dumps the content of specified tensors for operators developed based on operator projects and supports the printing of user-defined additional information (limited to the uint32_t data type), for example, the current line number.

Call the DumpTensor API to print tensor data at the target position in the operator kernel implementation code. An example is as follows.

       
            AscendC::DumpTensor(srcLocal,5, dataLen);

The printing function of DumpTensor affects the actual running performance of the operator. Therefore, this function is usually used in the commissioning phase. Developers can disable the printing function in either of the following ways:

Custom operator project

Modify the CMakeLists.txt file in the op_kernel directory of the operator project. Add the compilation option -DASCENDC_DUMP=0 to the first line to disable ASCENDC_DUMP. The following is an example.

             
                  // Disable the printf printing function of all operators.
add_ops_compile_options(ALL OPTIONS -DASCENDC_DUMP=0)

Kernel launch project

Modify the npu_lib.cmake file in the cmake directory. Add the -DASCENDC_DUMP=0 macro definition to the ascendc_compile_definitions command to disable the ASCENDC_DUMP function. The following is an example.

             
                  // Disable the printf printing function of all operators.
ascendc_compile_definitions(ascendc_kernels_${RUN_MODE} PRIVATE
    -DASCENDC_DUMP=0
)

During dump, the corresponding information header DumpHead (32 bytes) is added before the dump information of each block core to record the core ID and resource usage. The information header DumpTensorHead (32 bytes) is also added before the tensor data to be dumped each time to record tensor information. The information structure in the multi-core printing scenario is illustrated in the figure below.

The specific DumpHead information is as follows:

block_id: ID of the running core.
total_block_num: number of cores to be dumped.
block_remain_len: available dump space in the current core.
block_initial_space: initial dump space allocated in the current core.
magic: magic number for memory verification.

The specific DumpTensorHead information is as follows:

desc: user-defined additional information.
addr: tensor address.
data_type: tensor data type.
position: physical storage position of the tensor, which can only be Unified Buffer/L1 Buffer/L0C Buffer/Global Memory.

The values of CANN_VERSION_STR and CANN_TIMESTAMP are automatically printed at the beginning of the DumpTensor result. CANN_VERSION_STR and CANN_TIMESTAMP are macro definitions. CANN_VERSION_STR indicates the version number of the CANN package in the form of a string. CANN_TIMESTAMP indicates the timestamp when the CANN package is released, the value is in the format of uint64_t. You can directly use the two macros in the code.

The following is an example:

      
           CANN Version: XXX.XX, TimeStamp: 20240807XXXXXXXXX
DumpHead: block_id=0, total_block_num=16, block_remain_len=1048448, block_initial_space=1048576, magic=5aa5bccd
DumpTensor: desc=5, addr=0, data_type=DT_FLOAT16, position=UB
[40, 82, 60, 11, 24, 55, 52, 60, 31, 86, 53, 61, 47, 54, 34, 62, 84, 29, 48, 95, 16, 0, 20, 77, 3, 55, 69, 73, 75, 40, 35, 13]
CANN Version: XXX.XX, TimeStamp: 20240807XXXXXXXXX
DumpHead: block_id=1, total_block_num=16, block_remain_len=1048448, block_initial_space=1048576, magic=5aa5bccd
DumpTensor: desc=5, addr=0, data_type=DT_FLOAT16, position=UB
[58, 84, 22, 54, 41, 93, 1, 45, 50, 9, 72, 81, 23, 96, 86, 45, 36, 9, 36, 34, 78, 7, 2, 29, 47, 26, 13, 24, 27, 55, 90, 5]
...
CANN Version: XXX.XX, TimeStamp: 20240807XXXXXXXXX
DumpHead: block_id=7, total_block_num=16, block_remain_len=1048448, block_initial_space=1048576, magic=5aa5bccd
DumpTensor: desc=5, addr=0, data_type=DT_FLOAT16, position=UB
[28, 27, 79, 39, 86, 5, 23, 97, 89, 5, 65, 69, 59, 13, 49, 2, 34, 6, 52, 38, 4, 90, 11, 11, 61, 50, 71, 98, 19, 54, 54, 99]

Prototype

Printing without tensor shape

        
             void DumpTensor(const LocalTensor<T> &tensor, uint32_t desc, uint32_t dumpSize)
void DumpTensor(const GlobalTensor<T>& tensor, uint32_t desc, uint32_t dumpSize)

Printing with tensor shape

        
             void DumpTensor(const LocalTensor<T> &tensor, uint32_t desc, uint32_t dumpNum, const ShapeInfo& shapeInfo)
void DumpTensor(const GlobalTensor<T> &tensor, uint32_t desc, uint32_t dumpNum, const ShapeInfo& shapeInfo)

Parameters

Parameter	Input/Output	Description
tensor	Input	Tensor to be dumped. If the tensor to be dumped is stored in Unified Buffer/L1 Buffer/L0C Buffer, use the tensor parameter input of the LocalTensor type. If the tensor to be dumped is stored in Global Memory, use the tensor parameter input of the GlobalTensor type. Currently, the supported data types are uint8_t/int8_t/int16_t/uint16_t/int32_t/uint32_t/int64_t/uint64_t/float/half/bfloat16_t.
desc	Input	User-defined additional information (line numbers or other user-defined numbers).
dumpSize	Input	Number of elements to be dumped. The total length of elements to be dumped must be 32-byte aligned.
shapeInfo	Input	Shape information of the tensor, which can be printed.

Returns

None

Availability

Constraints

This function is used only for NPU on-board debugging and is supported only in the following scenarios:
- Calling operators in kernel launch mode.
- Calling operators through single-operator APIs.
- Calling a single-operator API (aclnnxxx) indirectly: single-operator calling in the PyTorch framework.
Currently, only information about tensors stored in Unified Buffer/L1 Buffer/L0C Buffer/Global Memory can be printed.
For details about the alignment requirements of the operand address offset, see General Restrictions.
The sum size of the space used by printf, assert, DumpAccChkPoint, DumpTensor , and framework dump function cannot exceed 1 MB on each core. Developers need to control the amount of data to be printed. If the limit is exceeded, no content will be printed.

Example

Printing without tensor shape

        
             AscendC::DumpTensor(srcLocal,5, dataLen);

Printing with tensor shape

        
             uint32_t array[] = {static_cast<uint32_t>(8),static_cast<uint32_t>(8)};
AscendC::ShapeInfo shapeInfo(2, array);       // Set dim to 2 and shape to (8,8).
AscendC::DumpTensor(x, 2, 64, shapeInfo);     // Dump 64 elements of x, which are parsed and arranged based on (8,8) of shapeInfo.

Information similar to the following is displayed:

        
             [[150.000000,83.000000,109.000000,166.000000,129.000000,50.000000,150.000000,74.000000],
[135.000000,79.000000,98.000000,134.000000,146.000000,166.000000,112.000000,70.000000],
[122.000000,51.000000,116.000000,68.000000,172.000000,72.000000,102.000000,69.000000],
[136.000000,83.000000,88.000000,88.000000,112.000000,148.000000,79.000000,136.000000],
[133.000000,104.000000,83.000000,71.000000,83.000000,99.000000,103.000000,151.000000],
[98.000000,118.000000,128.000000,83.000000,25.000000,105.000000,179.000000,34.000000],
[104.000000,169.000000,115.000000,113.000000,134.000000,121.000000,88.000000,96.000000],
[29.000000,139.000000,70.000000,40.000000,158.000000,138.000000,72.000000,171.000000]]

Parent topic: Operator Debugging APIs