GatherMask

Function Usage

Selects elements from the source operand and writes them to the destination operand based on a gather mask (for data collection) that corresponds to either the binary of the built-in fixed mode or the binary of the user-defined input tensor values.

Prototype

Custom mode

        
             template <typename T, typename U, GatherMaskMode mode = defaultGahterMaskMode>
__aicore__ inline void GatherMask(const LocalTensor<T>& dstLocal, const LocalTensor<T>& src0Local, const LocalTensor<U>& src1Pattern, const bool reduceMode, const uint32_t mask, const GatherMaskParams& gatherMaskParams, uint64_t& rsvdCnt)

Built-in fixed mode

        
             template <typename T, GatherMaskMode mode = defaultGahterMaskMode>
__aicore__ inline void GatherMask(const LocalTensor<T>& dstLocal, const LocalTensor<T>& src0Local, const uint8_t src1Pattern, const bool reduceMode, const uint32_t mask, const GatherMaskParams& gatherMaskParams, uint64_t& rsvdCnt)

Parameters

**Table 1** Parameters in the template
Parameter	Meaning
T	Data type of the source operand src0Local and destination operand dstLocal.
U	Data type of src1Pattern in user-defined mode. The supported data types are , uint16_t, and uint32_t. If the data type of the destination operand is half, uint16_t, int16_t, or , src1Pattern must be of the uint16_t type. If the data type of the destination operand is float, uint32_t, or int32_t, src1Pattern must be of the uint32_t data type.
mode	Reserved. A default value is provided. You do not need to set this parameter.

**Table 2** Parameters
Parameter	Input/Output	Meaning
dstLocal	Output	Destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned.
src0Local	Input	Source operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. Its data type must be the same as that of the destination operand.
src1Pattern	Input	There are two modes of gather mask (for data collection): built-in fixed mode and custom mode. Elements are from the source operand and written to the destination operand based on either the binary of the built-in fixed mode or the binary of the custom input tensor values. 1: selected; 0: not selected. Built-in fixed mode: The data type of src1Pattern is uint8_t, and the value range is [1,7]. All repeat iterations use the same gather mask. src1RepeatStride cannot be configured. 1: 01010101…0101 # Obtain an even number of index elements for each repeat. 2: 10101010…1010 # Obtain an odd number of index elements for each repeat. 3: 00010001…0001 # Obtain the first element among every four elements in each repeat. 4: 00100010…0010 # Obtain the second element among every four elements in each repeat. 5: 01000100…0100 # Obtain the third element among every four elements in each repeat. 6: 10001000…1000 # Obtain the fourth element among every four elements in each repeat. 7: 11111111...1111 # Obtain all elements in each repeat. User-defined mode: The data type of src1Pattern is LocalTensor. The iteration interval is determined by src1RepeatStride. src1Pattern is continuously consumed in an iteration.
reduceMode	Input	This parameter is used to select the mode of the mask parameter, and it is of the bool type. The options are as follows: false: normal mode. In this mode, 256-byte data is operated in each repeat operation. The total data computation amount is as follows: repeatTimes x 256 bytes. The mask parameter is invalid and must be set to 0. Set repeatTimes, src0BlockStride, and src0RepeatStride as required. src1Pattern can be set to the built-in fixed mode or user-defined mode. In user-defined mode, src1RepeatStride can be set as required. true: counter mode. Based on the meanings of parameters such as mask, this mode can be configured in either of the following methods: Configuration method 1: mask elements are operated in each repeat operation. The total data computation amount is as follows: repeatTimes x Number of mask elements. The mask value is set to the number of elements computed in each repeat. Set repeatTimes, src0BlockStride, and src0RepeatStride as required. src1Pattern can be set to the built-in fixed mode or user-defined mode. In user-defined mode, src1RepeatStride can be set as required. Configuration method 2: The total data computation amount is the number of mask elements. The mask value is configured as the total data computation amount. The repeatTimes value does not take effect. The number of iterations of an instruction is determined by the source operand and mask. Set src0BlockStride and src0RepeatStride as required. src1Pattern can be set to the built-in fixed mode or user-defined mode. In user-defined mode, src1RepeatStride can be set as required.
mask	Input	This parameter is used to control the elements that participate in computation in each iteration. Based on reduceMode, there are two mask modes: Normal mode: mask is invalid and needs to be set to 0. Counter mode: The value range is [1, 2³² – 1]. The meaning of the mask parameter in counter mode varies according to the product model. For details about the configuration methods, see the description of the reduceMode parameter.
gatherMaskParams	Input	Data structure that controls the operand address strides. It contains such parameters as those that specify the address stride of the operand for the same data block between adjacent iterations and address stride of the operand between different data blocks in a single iteration. The data structure is defined as follows: struct GatherMaskParams{ uint8_t src0BlockStride; uint16_t repeatTimes; uint16_t src0RepeatStride; uint8_t src1RepeatStride; }; For details about the address stride of the operand between adjacent iterations, see repeatStride. For details about the address stride of the operand between different data blocks in a single iteration, see dataBlockStride.
rsvdCnt	Output	Number of elements retained by the instruction, corresponding to the number of valid elements in dstLocal. The data type is uint64_t.

Returns

None

Availability

Precautions

To save memory space, you can define a tensor shared by the source and destination operands (by address overlapping). The general instruction restrictions are as follows.
- In a single iteration, the source operand must completely overlap the destination operand. Partial overlapping is not supported.
- During multiple iterations, if the Nth destination operand is the (N + 1)th source operand, address overlapping is not supported.
For details about the alignment requirements of the operand address offset, see General Restrictions.
If the counter mode is used before this API is called, the counter mode needs to be set again after this API is called. (The reason is that the counter mode will switch to the normal mode after this API is called.)

Example

Example of a custom tensor

#include "kernel_operator.h"
class KernelGatherMask {
public:
    __aicore__ inline KernelGatherMask () {}
    __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
    {
        src0Global.SetGlobalBuffer((__gm__ uint32_t*)src0Gm);
        src1Global.SetGlobalBuffer((__gm__ uint32_t*)src1Gm);
        dstGlobal.SetGlobalBuffer((__gm__ uint32_t*)dstGm);
        pipe.InitBuffer(inQueueSrc0, 1, 256 * sizeof(uint32_t));
        pipe.InitBuffer(inQueueSrc1, 1, 32 * sizeof(uint32_t));
        pipe.InitBuffer(outQueueDst, 1, 256 * sizeof(uint32_t));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<uint32_t> src0Local = inQueueSrc0.AllocTensor<uint32_t>();
        AscendC::LocalTensor<uint32_t> src1Local = inQueueSrc1.AllocTensor<uint32_t>();
        AscendC::DataCopy(src0Local, src0Global, 256);
        AscendC::DataCopy(src1Local, src1Global, 32);
        inQueueSrc0.EnQue(src0Local);
        inQueueSrc1.EnQue(src1Local);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<uint32_t> src0Local = inQueueSrc0.DeQue<uint32_t>();
        AscendC::LocalTensor<uint32_t> src1Local = inQueueSrc1.DeQue<uint32_t>();
        AscendC::LocalTensor<uint32_t> dstLocal = outQueueDst.AllocTensor<uint32_t>();
        uint32_t mask = 70;
       uint64_t rsvdCnt = 0;
        // reduceMode = true;    Use the counter mode.
        // src0BlockStride = 1. Data is continuously read and written at an interval of one data block in a single iteration.
        // repeatTimes = 2;      In counter mode, this parameter takes effect only on some product models.
        // src0RepeatStride = 4; The data interval between iterations of the source operand is four data blocks.
        // src1RepeatStride = 0; The data interval between iterations of src1 is 0 data blocks, that is, the data is read from the original position.
        AscendC::GatherMask (dstLocal, src0Local, src1Local, true, mask, { 1, 2, 4, 0 }, rsvdCnt);
        outQueueDst.EnQue<uint32_t>(dstLocal);
        inQueueSrc0.FreeTensor(src0Local);
        inQueueSrc1.FreeTensor(src1Local);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<uint32_t> dstLocal = outQueueDst.DeQue<uint32_t>();
        AscendC::DataCopy(dstGlobal, dstLocal, 256);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc0, inQueueSrc1;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<uint32_t> src0Global, src1Global, dstGlobal;
};
extern "C" __global__ __aicore__ void gather_mask_simple_kernel(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
{
    KernelGatherMask op;
    op.Init(src0Gm, src1Gm, dstGm);
    op.Process();
}

The following figure shows configuration method 1 in counter mode.

mask = 70. 70 elements are computed in each repeat.
repeatTimes = 2. There are two repeats in total.
src0BlockStride = 1. There is no interval between data blocks in a single iteration of the source operand src0Local.
src0RepeatStride = 4. The interval between adjacent iterations of the source operand src0Local is four data blocks. Therefore, the second repeat starts from the 33rd element.
src1Pattern is set to the user-defined mode. src1RepeatStride = 0. The interval between adjacent iterations of src1Pattern is 0 data blocks. Therefore, the second repeat still starts from the start address of src1Pattern.

Figure 1 Configuration method 1 in counter mode

The following figure shows configuration method 2 in counter mode.

mask = 70. A total of 70 elements are computed.
The configuration of repeatTimes does not take effect. It is automatically inferred based on the source operand and mask. The data type of the source operand is uint32_t. Each iteration processes 256 bytes of data (64 elements), so two repeats are required.
src0BlockStride = 1. There is no interval between data blocks in a single iteration of the source operand src0Local.
src0RepeatStride = 4. The interval between adjacent iterations of the source operand src0Local is four data blocks. Therefore, the second repeat starts from the 33rd element.
src1Pattern is set to the user-defined mode. src1RepeatStride = 0. The interval between adjacent iterations of src1Pattern is 0 data blocks. Therefore, the second repeat still starts from the start address of src1Pattern.

Figure 2 Configuration method 2 in counter mode

Built-in fixed mode

#include "kernel_operator.h"
class KernelGatherMask {
public:
    __aicore__ inline KernelGatherMask () {}
    __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* dstGm)
    {
        src0Global.SetGlobalBuffer((__gm__ uint16_t*)src0Gm);
        dstGlobal.SetGlobalBuffer((__gm__ uint16_t*)dstGm);
        pipe.InitBuffer(inQueueSrc0, 1, 128 * sizeof(uint16_t));
        pipe.InitBuffer(outQueueDst, 1, 128 * sizeof(uint16_t));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<uint16_t> src0Local = inQueueSrc0.AllocTensor<uint16_t>();
        AscendC::DataCopy(src0Local, src0Global, 128);
        inQueueSrc0.EnQue(src0Local);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<uint16_t> src0Local = inQueueSrc0.DeQue<uint16_t>();
        AscendC::LocalTensor<uint16_t> dstLocal = outQueueDst.AllocTensor<uint16_t>();
 
        uint32_t mask = 0; // Set mask to 0 in normal mode.
        uint64_t rsvdCnt = 0; // Save the number of elements retained after filtering.
        uint8_t src1Pattern = 2; // Built-in fixed mode
        // reduceMode = false. Use the normal mode.
        // src0BlockStride = 1. Data is continuously read and written at an interval of one block in a single iteration.
        // repeatTimes = 1. Repeat iteration once.
        // src0RepeatStride = 0. Repeat once. Therefore, set src0RepeatStride to 0.
        // src1RepeatStride = 0. Repeat once. Therefore, set src1RepeatStride to 0.
        AscendC::GatherMask(dstLocal, src0Local, src1Pattern, false, mask, { 1, 1, 0, 0 }, rsvdCnt);
 
        outQueueDst.EnQue<uint16_t>(dstLocal);
        inQueueSrc0.FreeTensor(src0Local);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<uint16_t> dstLocal = outQueueDst.DeQue<uint16_t>();
        
        AscendC::DataCopy(dstGlobal, dstLocal, 128);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc0;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<uint16_t> src0Global, dstGlobal;
};
 
extern "C" __global__ __aicore__ void gather_mask_simple_kernel(__gm__ uint8_t* src0Gm, __gm__ uint8_t* dstGm)
{
    KernelGatherMask op;
    op.Init(src0Gm, dstGm);
    op.Process();
}

Result example:

Input (src0Local): [1 2 3 ... 128]
Input (src1Pattern): src1Pattern = 2;
Output (dstLocal): [2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 undefine ..undefined]
Output (rsvdCnt): 64

Parent topic: Selection Instructions