Operator Precision Problem Occurs During Kernel Function Running and Verification

Symptom

During the running and verification of the operator NPU, the operator precision is compared in modes such as md5sum. The actual data is inconsistent with the true data, and the operator precision problem occurs. In this example, md5sum is used for accuracy comparison. The printed true value is different from the MD5 value of the actual output data. The detailed information is as follows:

1
2
3
md5sum:
45e17ee4c068a655be2af4d8c3a1f191  output/golden.bin
6a99e41a84b14dd04f32730ceb9a3988  output/output_y.bin

Cause Analysis

Generally, the precision of the operator is incorrect because of the incorrect implementation logic.

Troubleshooting Procedure

Ascend C provides the twin debugging function to locate the implementation logic problem of an operator through function verification in the CPU, GDB step-by-step debugging, and printf value printing. This example shows only the possible scenarios to facilitate the demonstration of the fault locating procedure. During actual use, debug the code based on the actual situation.

  1. Verify the functions in the CPU and check whether any error is reported in logs.

    Refer to Kernel Launch to compile the runtime verification code on the CPU and perform running verification. The accuracy comparison result of the CPU is as follows:

    1
    2
    3
    md5sum:
    45e17ee4c068a655be2af4d8c3a1f191  output/golden.bin
    5d6e1aec686b28bd3839dbcd5caaa8b2  output/output_y.bin
    

    The precision of the CPU is also inconsistent.

    Check whether an error is reported in the log. You can search for the keyword failed. For example, the following is an error example, which indicates that the error occurs where the LeakyReLU API is called in the code.

    1
    leakyrelu_custom_cpu: /usr/local/Ascend/CANN-7.0/x86_64-linux/tikcpp/tikcfw/interface/kernel_operator_vec_binary_scalar_intf.h:447: void AscendC::LeakyRelu(const AscendC::LocalTensor<T>&, const AscendC::LocalTensor<T>&, const T&, const int32_t&) [with T = float16::Fp16T; int32_t = int]: Assertion `false && "check vlrelu instr failed"' failed
    

    Generally, only the code line where the error is reported instead of the specific error can be located based on the preceding error logs. Use the GDB debugging mode or printf printing mode to further locate the fault.

  2. Perform GDB debugging. The following example shows how to start the leakyReLU operator running program on the CPU side. The sample program directly identifies an exception and runs the GDB. View the call stack information to analyze and locate the fault. In other scenarios, you can perform debugging using basic operations such as GDB breakpoint setting. For details about how to use the GDB to debug the Ascend C program, see Debugging on the CPU.
    1. Use the GDB to start the program to be debugged and access the GDB interface for debugging.
      gdb leakyrelu_custom_cpu
    2. Debugs a subprocess separately.
      (gdb) set follow-fork-mode child
    3. Run the program.
      (gdb) r
    4. Run the bt command to view the program call stack.
      (gdb) bt
    5. View the stack information of a specific layer and print the values of specific variables. In this example, the value of tileLength is 1024. In this program, 1024 half-type numbers need to be processed, and the size is 1024 x sizeof(half) = 2048 bytes. Input the value of Tensor xLocal. dataLen indicates that the size of the LocalTensor is 1024 bytes and only 1024-byte data can be computed. The preceding information indicates that the lengths of the two parameters do not match. Therefore, the fault can be located.
      (gdb) f 5
      #5  0x000055555555d364 in KernelLeakyRelu::Compute (this=0x7fffffffd7d0, progress=0) at /root/AscendC_DemoCode-master/precision-error/vector/leakyrelu_custom.cpp:59
      59              LeakyRelu(yLocal, xLocal, scalar, tileLength);
      (gdb) p tileLength
      $1 = 1024
      (gdb) p xLocal
      $1 = {<AscendC::BaseTensor<float16::Fp16T>> = {<No data fields>}, address_ = {logicPos = 9 '\t', bufferHandle = 0x7fffffffd930 "\003\005\377\377", dataLen = 1024,bufferAddr = 0,absAddr = ...}
  3. Use printf to print logs. Add variable printing in a proper position. The sample code is as follows:
    printf("xLocal size: %d\n", xLocal.GetSize());
    printf("tileLength: %d\n", tileLength);

    In the following log, tileLength is 1024. In this program, 1024 half-type numbers need to be processed. The size of Tensor xLocal is 512, indicating that only 512 half-type numbers can be computed. The preceding information indicates that the lengths of the two parameters do not match. Therefore, the fault can be located.

    1
    2
    xLocal size: 512
    tileLength: 1024