AllocTensor or FreeTensor Fails During Runtime Verification

Symptom

The system is suspended during runtime verification of the kernel function on the NPU; or AllocTensor or FreeTensor fails during runtime verification of the kernel function on the CPU. The error log and call stack printing information is as follows:

[ERROR][Core_0][/usr/local/Ascend/cann/x86_64-linux/tikcpp/tikcfw/interface/kernel_tpipe.h:730][AllocEventID][321678] current size is 4, max buffer number in same queue position is 4
[ERROR][CORE_0][pid 321674] error happened! =========
SIGABRT Signal (Abort Signal from abort) catched, backtrace info:
[#0] 0x000000000001e7c0: handler(int) at /usr/local/Ascend/cann/tools/tikicpulib/lib/include/kern_fwk.h:105
[#1] 0x0000000000017c4f: signed char AscendC::TPipe::AllocEventID<(AscendC::HardEvent)5>() at /usr/local/Ascend/cann/x86_64-linux/tikcpp/tikcfw/interface/kernel_tpipe.h:733
[#2] 0x000000000001426d: AscendC::TQueBind<(AscendC::TPosition)0, (AscendC::TPosition)9, 4, 0>::FreeBuffer(unsigned char*) at /usr/local/Ascend/cann/x86_64-linux/tikcpp/tikcfw/interface/kernel_tpipe.h:1217
[#3] 0x0000000000011058: void AscendC::TQueBind<(AscendC::TPosition)0, (AscendC::TPosition)9, 4, 0>::FreeTensor<float16::Fp16T>(AscendC::LocalTensor<float16::Fp16T>&) at /usr/local/Ascend/cann/x86_64-linux/tikcpp/tikcfw/interface/kernel_tpipe.h:1237
[#4] 0x000000000000dfde: KernelAdd::Compute(int) at /home/xxxx/xxxx.cpp:59
[#5] 0x000000000000dd1c: KernelAdd::Process() at /home/xxxx/xxxx.cpp:37 (discriminator 2)
...

Cause Analysis

According to the log information "current size is 4, max buffer number in same queue position is 4", the issue arises when the number of QUE buffers on the same TPosition exceeds the upper limit.

The number of tensors that can be consecutively allocated by calling the AllocTensor API on all queues of the same TPosition is restricted, which varies depending on the AI processor model, and must meet the following constraints during buffer allocation.

Atlas training products: The maximum number is 4.

Atlas inference product's AI Core: The maximum number is 8.

Atlas inference product's Vector Core: The maximum number is 8.

Atlas A2 training products/Atlas A2 inference products: The maximum number is 8.

Atlas A3 training products/Atlas A3 inference products: The maximum number is 8.

Atlas 200I/500 A2 inference products: The maximum number is 8.

If these constraints are not met, resource allocation may fail when AllocTensor or FreeTensor is used. For example:

AscendC::TQue<AscendC::TPosition::VECIN, 1> que0;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que1;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que2;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que3;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que4;
AscendC::TQue<AscendC::TPosition::VECIN, 1> que5;
// For example, if the operator has six inputs, six buffers are allocated.
// Allocate one buffer for each of the six queues que0 to que5. The total number of buffers allocated on VECIN TPosition is 6.
// Assume that the maximum number of buffers that can be allocated consecutively on the same position is 4. If the number exceeds 4, resource allocation may fail when AllocTensor or FreeTensor is used.
// Abnormal behaviors such as suspension may occur on the NPU. In the CPU debugging scenario, an error is reported.
pipe.InitBuffer(que0, 1, len);
pipe.InitBuffer(que1, 1, len);
pipe.InitBuffer(que2, 1, len);
pipe.InitBuffer(que3, 1, len);
pipe.InitBuffer(que4, 1, len);
pipe.InitBuffer(que5, 1, len);

AscendC::LocalTensor<T> local1 = que0.AllocTensor<T>();
AscendC::LocalTensor<T> local2 = que1.AllocTensor<T>();
AscendC::LocalTensor<T> local3 = que2.AllocTensor<T>();
AscendC::LocalTensor<T> local4 = que3.AllocTensor<T>();
// The fifth AllocTensor call fails to allocate resources. The number of tensors allocated on the same TPosition at the same time exceeds 4.
AscendC::LocalTensor<T> local5 = que4.AllocTensor<T>();

Solution

If multiple buffers are used, you can combine multiple buffers into one buffer and use the buffer through offset. An example is as follows.

// You are advised to perform the following operations to solve the problem:
// If multiple buffers are used, you can combine multiple buffers into one buffer and use the buffer through offset.
pipe.InitBuffer(que0, 1, len * 3);
pipe.InitBuffer(que1, 1, len * 3);
/*
* Three local tensors are allocated. The address of local1 is the start address of the buffer in que0.
* The address of local2 is the address of local1 with offset len, and the address of local3 is the offset address of local1.
* len * 2 address
 */
int32_t offset1 = len;
int32_t offset2 = len * 2;
AscendC::LocalTensor<T> local1 = que0.AllocTensor<T>();
AscendC::LocalTensor<T> local2 = local1[offset1];
AscendC::LocalTensor<T> local3 = local1[offset2];

Parent topic: FAQ