AllocTensor or FreeTensor Fails During Runtime Verification

Symptom

The system is suspended during runtime verification of the kernel function on the NPU; or AllocTensor or FreeTensor fails during runtime verification of the kernel function on the CPU. The error log and call stack printing information is as follows:

[ERROR][Core_0][/usr/local/Ascend/latest/x86_64-linux/tikcpp/tikcfw/interface/kernel_tpipe.h:730][AllocEventID][321678] current size is 4, max buffer number in same queue position is 4
[ERROR][CORE_0][pid 321674] error happened! =========
SIGABRT Signal (Abort Signal from abort) catched, backtrace info:
[#0] 0x000000000001e7c0: handler(int) at /usr/local/Ascend/latest/tools/tikicpulib/lib/include/kern_fwk.h:105
[#1] 0x0000000000017c4f: signed char AscendC::TPipe::AllocEventID<(AscendC::HardEvent)5>() at /usr/local/Ascend/latest/x86_64-linux/tikcpp/tikcfw/interface/kernel_tpipe.h:733
[#2] 0x000000000001426d: AscendC::TQueBind<(AscendC::TPosition)0, (AscendC::TPosition)9, 4, 0>::FreeBuffer(unsigned char*) at /usr/local/Ascend/latest/x86_64-linux/tikcpp/tikcfw/interface/kernel_tpipe.h:1217
[#3] 0x0000000000011058: void AscendC::TQueBind<(AscendC::TPosition)0, (AscendC::TPosition)9, 4, 0>::FreeTensor<float16::Fp16T>(AscendC::LocalTensor<float16::Fp16T>&) at /usr/local/Ascend/latest/x86_64-linux/tikcpp/tikcfw/interface/kernel_tpipe.h:1237
[#4] 0x000000000000dfde: KernelAdd::Compute(int) at /home/xxxx/xxxx.cpp:59
[#5] 0x000000000000dd1c: KernelAdd::Process() at /home/xxxx/xxxx.cpp:37 (discriminator 2)
...

Cause Analysis

According to the log information "current size is 4, max buffer number in same queue position is 4", the issue arises when the number of QUE buffers on the same TPosition exceeds the upper limit.

The number of QUE buffers on the same TPosition varies depending on the AI processor model, and must meet the following constraint during buffer allocation.

Atlas Training Series Product: The maximum number is 4.

If this constraint is not met, resource allocation may fail when AllocTensor or FreeTensor is used. Example:

TQue<TPosition::VECIN, 1> que0;
TQue<TPosition::VECIN, 1> que1;
// Not recommended:
// For example, if the operator has six inputs, six buffers are allocated.
// Allocate three buffers for que0 and que1 respectively. Allocate six buffers in the VECIN position.
// Atlas Training Series Product : The maximum number of QUE buffers on the same TPosition is 4. If the number exceeds 4, resource allocation may fail when AllocTensor or FreeTensor is used.
pipe.InitBuffer(que0, 3, len);
pipe.InitBuffer(que1, 3, len);

Solution

If multiple buffers are used, you can combine multiple buffers into one buffer and use the buffer through offset. An example is as follows:

// You are advised to perform the following operations to solve the problem:
// If multiple buffers are used, you can combine multiple buffers into one buffer and use the buffer through offset.
pipe.InitBuffer(que0, 1, len * 3)
pipe.Initbuffer(que1, 1, len * 3)
/*
* Three local tensors are allocated. The address of local1 is the start address of the buffer in que0.
* The address of local2 is the address of local1 with offset len, and the address of local3 is the offset address of local1.
* len * 2 address
 */
int32_t offset1 = len;
int32_t offset2 = len * 2;
LocalTensor<T> local1 = que0.AllocTensor<T>();
LocalTensor<T> local2 = local1[offset1];
LocalTensor<T> local3 = local1[offset2];

Parent topic: FAQ