AllocTensor

Supported Products

Product

Supported/Unsupported

Atlas A3 training products / Atlas A3 inference products

Atlas A2 training products / Atlas A2 inference products

Atlas 200I/500 A2 inference products

Atlas inference product 's AI Core

Atlas inference product 's Vector Core

x

Atlas training products

Function Usage

Allocates tensors from the queue. The size occupied by tensors is the length of each buffer configured when InitBuffer is used.

Prototype

  • Non-inplace API: Constructs a new tensor as the object for memory management.
    1
    2
    template <typename T>
    __aicore__ inline LocalTensor<T> AllocTensor()
    
  • Inplace API: Directly uses the input tensor as the object for memory management, reducing the overhead of repeatedly creating tensors. For details about the usage guide, see How Do I Use Tensor In-place Operations to Improve Operator Performance?.
    1
    2
    template <typename T>
    __aicore__ inline void AllocTensor(LocalTensor<T>& tensor)
    

Parameters

Table 1 Parameters in the template

Parameter

Description

T

Tensor data type,

Table 2 Parameters

Parameter

Input/Output

Meaning

tensor

Input

For the inplace API, LocalTensor needs to be passed as the object for memory management.

Restrictions

  • The number of tensors that can be consecutively allocated by calling the AllocTensor API on all queues of the same TPosition is restricted, which varies depending on the AI processor model, and must meet the following constraints during buffer allocation.

    Atlas training products : The maximum number is 4.

    Atlas inference product 's AI Core: The maximum number is 8.

    Atlas inference product 's Vector Core: The maximum number is 8.

    Atlas A2 training products / Atlas A2 inference products : The maximum number is 8.

    Atlas A3 training products / Atlas A3 inference products : The maximum number is 8.

    Atlas 200I/500 A2 inference products : The maximum number is 8.

  • The tensor allocated by the non-inplace API may contain random values.
  • For the non-inplace API, the depth template parameter of TQueBind must be set to a non-zero value. For the inplace API, the depth template parameter of TQueBind must be set to 0.

Returns

The non-inplace API returns a LocalTensor object, and the inplace API returns no value.

Example

  • Example 1
    1
    2
    3
    4
    5
    6
    7
    // Use AllocTensor to allocate tensors.
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECOUT, 2> que;
    int num = 2;
    int len = 1024;
    pipe.InitBuffer(que, num, len); // Two buffers are allocated by InitBuffer, and the size of each buffer is 1024 bytes.
    AscendC::LocalTensor<half> tensor1 = que.AllocTensor<half>(); // The length of the tensor allocated by AllocTensor is 1024 bytes.
    
  • Example 2
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    // Use AllocTensor consecutively.
    AscendC::TQue<AscendC::TPosition::VECIN, 1> que0;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> que1;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> que2;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> que3;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> que4;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> que5;
    // Not recommended:
    // For example, if the operator has six inputs, six buffers need to be allocated.
    // Allocate one buffer for each of the six queues que0 to que5. The total number of buffers allocated on VECIN TPosition is 6.
    // Assume that the maximum number of buffers that can be allocated consecutively on the same TPosition is 4. If the number exceeds 4, resource allocation may fail when AllocTensor or FreeTensor is used.
    // Abnormal behaviors such as suspension may occur on the NPU. In the CPU debugging scenario, an error is reported.
    pipe.InitBuffer(que0, 1, len);
    pipe.InitBuffer(que1, 1, len);
    pipe.InitBuffer(que2, 1, len);
    pipe.InitBuffer(que3, 1, len);
    pipe.InitBuffer(que4, 1, len);
    pipe.InitBuffer(que5, 1, len);
    
    AscendC::LocalTensor<T> local1 = que0.AllocTensor<T>();
    AscendC::LocalTensor<T> local2 = que1.AllocTensor<T>();
    AscendC::LocalTensor<T> local3 = que2.AllocTensor<T>();
    AscendC::LocalTensor<T> local4 = que3.AllocTensor<T>();
    // The fifth AllocTensor fails to allocate resources. The number of tensors allocated on the same TPosition at the same time exceeds 4.
    AscendC::LocalTensor<T> local5 = que4.AllocTensor<T>();
    
    // You are advised to perform the following operations to solve the problem:
    // If multiple buffers are used, you can combine multiple buffers into one buffer and use the buffer through offset.
    pipe.InitBuffer(que0, 1, len * 3);
    pipe.InitBuffer(que1, 1, len * 3);
    /*
    * Three local tensors are allocated. The address of local1 is the start address of the buffer in que0.
    * The address of local2 is the address of local1 with offset len, and the address of local3 is the offset address of local1.
    * len * 2 address
     */
    int32_t offset1 = len;
    int32_t offset2 = len * 2;
    AscendC::LocalTensor<T> local1 = que0.AllocTensor<T>();
    AscendC::LocalTensor<T> local2 = local1[offset1];
    AscendC::LocalTensor<T> local3 = local1[offset2];
    
  • Example 3: inplace interface.
    1
    2
    3
    4
    5
    6
    7
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 0> que;
    int num = 2;
    int len = 1024;
    pipe.InitBuffer(que, num, len); // Two buffers are allocated by InitBuffer, and the size of each buffer is 1024 bytes.
    AscendC::LocalTensor<half> tensor1;
    The que.AllocTensor<half>(tensor1); // AllocTensor allocates a tensor of 1024 bytes.