Gather

Applicability

Product	Supported
Atlas A3 training products / Atlas A3 inference products	√
Atlas A2 training products / Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	√
Atlas inference product 's AI Core	√
Atlas inference product 's Vector Core	x
Atlas training products	x

Function

Gathers elements from the input tensor into the result tensor based on the provided address offset tensor.

Prototype

Computation of the first n data elements of a tensor

        
             template <typename T>
__aicore__ inline void Gather(const LocalTensor<T>& dst, const LocalTensor<T>& src, const LocalTensor<uint32_t>& srcOffset, const uint32_t srcBaseAddr, const uint32_t count)

High-dimensional tensor sharding computation

Bitwise mask mode

          
               template <typename T>
__aicore__ inline void Gather(const LocalTensor<T>& dst, const LocalTensor<T>& src, const LocalTensor<uint32_t>& srcOffset, const uint32_t srcBaseAddr, const uint64_t mask[], const uint8_t repeatTime, const uint16_t dstRepStride)

Contiguous mask mode

          
               template <typename T>
__aicore__ inline void Gather(const LocalTensor<T>& dst, const LocalTensor<T>& src, const LocalTensor<uint32_t>& srcOffset, const uint32_t srcBaseAddr, const uint64_t mask, const uint8_t repeatTime, const uint16_t dstRepStride)

Parameters

**Table 1** Template parameters
Parameter	Description
T	Operand data type. For the Atlas A3 training products / Atlas A3 inference products , the supported data types are int16_t, uint16_t, int32_t, uint32_t, float, half, and bfloat16_t. For the Atlas A2 training products / Atlas A2 inference products , the supported data types are int16_t, uint16_t, int32_t, uint32_t, float, half, and bfloat16_t. For the Atlas 200I/500 A2 inference products , the supported data types are uint8_t, int8_t, uint16_t, int16_t, half, uint32_t, int32_t, and float. For the Atlas inference product 's AI Core, the supported data types are int16_t, uint16_t, int32_t, uint32_t, float, and half.

**Table 2** Parameters
Parameter	Input/Output	Meaning
dst	Output	Destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned.
src	Input	Source operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. Its data type must match that of dst.
srcOffset	Input	Address offset of each element in src. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. The offset (in bytes) is relative to the base address of src. The value must meet the following requirements: The value must be aligned to the bit width of the src element type. The final address must not exceed the UB range. For the following models, the address offset must be within the range of uint32_t. Atlas inference product 's AI Core Atlas A2 training products / Atlas A2 inference products Atlas A3 training products / Atlas A3 inference products For the following models, the address offset must be within the range of [0, 2¹⁶ – 1] when the operand is 8-bit, within the range of [0, 2¹⁷ – 1] when the operand is 16-bit, and within the range of uint32_t when the operand is 32-bit or 64-bit. Atlas 200I/500 A2 inference products
srcBaseAddr	Input	Base address of src, which specifies the start position of the source operand in the gather operation. The unit is byte. Ensure that the bit width of the src element type is aligned. Otherwise, unexpected behavior occurs.
count	Input	Number of data elements to be processed.
mask/mask[]	Input	mask is used to control the elements that participate in computation in each iteration. Contiguous mode: indicates the number of contiguous elements that participate in computation. The value range is related to the operand data type. The maximum number of elements that can be processed in each repeat varies according to the data type. When the operand is 8-bit or 16-bit, mask ∈ [1, 128]. When the operand is 32-bit, mask ∈ [1, 64]. When the operand is 64-bit, mask ∈ [1, 32]. Bitwise mode: controls which elements are involved in computation bit by bit. A bit value of 1 means the corresponding element participates in computation, while 0 means it does not. The parameter type is a uint64_t array with a length of 2. For example, if mask = [0, 8] and 8 = 0b1000, only the fourth element participates in computation. The value range is related to the operand data type. The maximum number of elements that can be processed in each repeat varies according to the data type. When the operand is 8-bit or 16-bit, mask[0] and mask[1] ∈ [0, 2⁶⁴ – 1] and cannot be 0 at the same time. When the operand is 32-bit, mask[1] is 0 and mask[0] ∈ (0, 2⁶⁴ – 1]. When the operand is 64-bit, mask[1] is 0 and mask[0] ∈ (0, 2³² – 1].
repeatTime	Input	Number of instruction repeats. Eight data blocks (32 bytes) are collected in each repeat. The value range is [0, 255] For the following models: Atlas 200I/500 A2 inference products When the operand is 8-bit, four data blocks (32 bytes) are collected in each repeat.
dstRepStride	Input	Address stride of the operand between adjacent iterations. The unit is DataBlock (32 bytes).

Restrictions

For details about the operand address alignment requirements, see General Address Alignment Restrictions.
For details about the constraints on operand address overlapping, see General Address Overlapping Restrictions.

Example

      
       
         
         
           #include "kernel_operator.h"
template <typename T>
class GatherTest {
public:
    __aicore__ inline GatherTest() {}
    __aicore__ inline void Init(__gm__ uint8_t* dstGm, __gm__ uint8_t* srcGm,
        __gm__ uint8_t* srcOffsetGm, const uint32_t count)
    {
        m_elementCount = count;
        m_dstGlobal.SetGlobalBuffer((__gm__ T*)dstGm);
        m_srcGlobal.SetGlobalBuffer((__gm__ T*)srcGm);
        m_srcOffsetGlobal.SetGlobalBuffer((__gm__ uint32_t*)srcOffsetGm);
        m_pipe.InitBuffer(m_queIn, 2, m_elementCount * sizeof(uint32_t));
        m_pipe.InitBuffer(m_queOut, 2, m_elementCount * sizeof(uint32_t));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<T> srcLocal = m_queIn.AllocTensor<T>();
        AscendC::DataCopy(srcLocal, m_srcGlobal, m_elementCount);
        m_queIn.EnQue(srcLocal);
        AscendC::LocalTensor<uint32_t> srcOffsetLocal = m_queIn.AllocTensor<uint32_t>();
        AscendC::DataCopy(srcOffsetLocal, m_srcOffsetGlobal, m_elementCount);
        m_queIn.EnQue(srcOffsetLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<T> srcLocal = m_queIn.DeQue<T>();
        AscendC::LocalTensor<uint32_t> srcOffsetLocal = m_queIn.DeQue<uint32_t>();
        AscendC::LocalTensor<T> dstLocal = m_queOut.AllocTensor<T>();
        srcLocal.SetSize(m_elementCount);
        AscendC::Gather(dstLocal, srcLocal, srcOffsetLocal, (uint32_t)0, m_elementCount);
        m_queIn.FreeTensor(srcLocal);
        m_queIn.FreeTensor(srcOffsetLocal);
        m_queOut.EnQue(dstLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<T> dstLocal = m_queOut.DeQue<T>();
        AscendC::DataCopy(m_dstGlobal, dstLocal, m_elementCount);
        m_queOut.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe m_pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> m_queCalc;
    AscendC::GlobalTensor<T> m_valueGlobal;
    uint32_t m_concatRepeatTimes;
    uint32_t m_sortRepeatTimes;
    uint32_t m_extractRepeatTimes;
    uint32_t m_elementCount;
    AscendC::GlobalTensor<uint32_t> m_srcOffsetGlobal;
    AscendC::GlobalTensor<T> m_srcGlobal;
    AscendC::GlobalTensor<T> m_dstGlobal;
    AscendC::TQue<AscendC::TPosition::VECIN, 2> m_queIn;
    AscendC::TQue<AscendC::TPosition::VECOUT, 2> m_queOut;
}; // class GatherTest

extern "C" __global__ __aicore__ void kernel_gather(GM_ADDR dstGm, GM_ADDR srcGm, GM_ADDR srcOffsetGm)
{
    GatherTest<half> op; 
    op.Init(dstGm, srcGm, srcOffsetGm, 128);
    op.Process();
}

          

        

      
     

Result example:

Input srcOffsetLocal:
[254 252 250 ... 4 2 0]
Input srcLocal (128 data elements of the half type):
[0 1 2 ... 125 126 127]
Output dstGlobal:
[127 126 125 ... 2 1 0]

Parent topic: Gather and Scatter