Board Debugging on the NPU

Debugging with DumpTensor and printf

The functions of printing board data on the NPU include DumpTensor and printf. DumpTensor is used to print the data of a specified tensor, and printf is used to print scalar and string information.

This function is supported only in the following scenarios:

  • Calling operators through single-operator APIs.
  • Calling a single-operator API (aclnnxxx) indirectly: single-operator calling in the PyTorch framework.

Usage:

Call the DumpTensor or printf API to print related content where log information needs to be output in the implementation code of the operator kernel.
  • In the following example of DumpTensor, srcLocal indicates the tensor to be printed, 5 indicates the additional custom information, such as the current code line number, and dataLen indicates the number of elements. For details about the usage and restrictions of the DumpTensor API, see DumpTensor.
    1
    DumpTensor(srcLocal,5, dataLen);
    

    During dump, the corresponding DumpHead (32 bytes) is added before the dump information of each block core to record the core ID and resource usage. DumpTensorHead (32 bytes) is also added before the tensor data to be dumped each time to record tensor information. An example of the printing result is as follows:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    DumpHead: block_id=0, total_block_num=16, block_remain_len=1048448, block_initial_space=1048576, magic=5aa5bccd
    DumpTensor: desc=5, addr=0, data_type=DT_FLOAT16, position=UB
    [40, 82, 60, 11, 24, 55, 52, 60, 31, 86, 53, 61, 47, 54, 34, 62, 84, 29, 48, 95, 16, 0, 20, 77, 3, 55, 69, 73, 75, 40, 35, 13]
    DumpHead: block_id=1, total_block_num=16, block_remain_len=1048448, block_initial_space=1048576, magic=5aa5bccd
    DumpTensor: desc=5, addr=0, data_type=DT_FLOAT16, position=UB
    [58, 84, 22, 54, 41, 93, 1, 45, 50, 9, 72, 81, 23, 96, 86, 45, 36, 9, 36, 34, 78, 7, 2, 29, 47, 26, 13, 24, 27, 55, 90, 5]
    ...
    DumpHead: block_id=7, total_block_num=16, block_remain_len=1048448, block_initial_space=1048576, magic=5aa5bccd
    DumpTensor: desc=5, addr=0, data_type=DT_FLOAT16, position=UB
    [28, 27, 79, 39, 86, 5, 23, 97, 89, 5, 65, 69, 59, 13, 49, 2, 34, 6, 52, 38, 4, 90, 11, 11, 61, 50, 71, 98, 19, 54, 54, 99]
    
  • The following is an example of the printf printing. For details about the usage and restrictions of the printf API, see printf.
    1
    printf("fmt string %d", 0x123);
    

DumpTensor and printf affect the actual runtime performance of the operator. Therefore, this function is usually used in the debugging phase. You can disable the printing function as required. For details, see DumpTensor and printf.

Profiling and Analysis on the NPU

The debugging methods provided in this section are based on the operator program in Kernel Launch. Ensure that you have learnt the runtime verification of the kernel function in related sections.

The operator program compiled through the API call (kernel launch symbol <<<>>>) of operators on the NPU is built into an executable program through the BiSheng Compiler. You can run the executable program to verify the running of operators on the NPU. Then, use the profiling tool to run the executable file generated in NPU mode to collect the profile data of Ascend C operators executed on the AI processor for refined performance tuning. For details about how to use the profiling tool, see Performance Tuning Tool User Guide .