GatherMask

Applicability

Product	Supported
Atlas A3 training products/Atlas A3 inference products	√
Atlas A2 training products/Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	√
Atlas inference product's AI Core	√
Atlas inference product's Vector Core	x
Atlas training products	x

Function

Selects elements from the source operand and writes them to the destination operand based on a gather mask (for data collection) that corresponds to either the binary of the built-in fixed mode or the binary of the user-defined input tensor values.

Prototype

Custom mode

template <typename T, typename U, GatherMaskMode mode = defaultGatherMaskMode>
__aicore__ inline void GatherMask(const LocalTensor<T>& dst, const LocalTensor<T>& src0, const LocalTensor<U>& src1Pattern, const bool reduceMode, const uint32_t mask, const GatherMaskParams& gatherMaskParams, uint64_t& rsvdCnt)

Built-in fixed mode

template <typename T, GatherMaskMode mode = defaultGatherMaskMode>
__aicore__ inline void GatherMask(const LocalTensor<T>& dst, const LocalTensor<T>& src0, const uint8_t src1Pattern, const bool reduceMode, const uint32_t mask, const GatherMaskParams& gatherMaskParams, uint64_t& rsvdCnt)

Parameters

**Table 1** Template parameters
Parameter	Description
T	Data type of the source operand src0 and destination operand dst. For Atlas A3 training products/Atlas A3 inference products, the supported data types are half, bfloat16_t, uint16_t, int16_t, float, uint32_t, and int32_t. For Atlas A2 training products/Atlas A2 inference products, the supported data types are half, bfloat16_t, uint16_t, int16_t, float, uint32_t, and int32_t. For Atlas 200I/500 A2 inference products, the supported data types are half, uint16_t, int16_t, float, uint32_t, and int32_t. For the Atlas inference product's AI Core, the supported data types are half, uint16_t, int16_t, float, uint32_t, and int32_t.
U	Data type of src1Pattern in user-defined mode. The supported data types are , uint16_t, and uint32_t. If the data type of the destination operand is half, uint16_t, int16_t, or , src1Pattern must be of the uint16_t type. If the data type of the destination operand is float, uint32_t, or int32_t, src1Pattern must be of the uint32_t data type.
mode	Reserved. A default value is provided. You do not need to set this parameter.

**Table 2** Parameters
Parameter	Input/Output	Description
dst	Output	Destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned.
src0	Input	Source operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. Its data type must be the same as that of the destination operand.
src1Pattern	Input	There are two modes of gather mask (for data collection): built-in fixed mode and custom mode. Elements are from the source operand and written to the destination operand based on either the binary of the built-in fixed mode or the binary of the custom input tensor values. 1: selected; 0: not selected. Built-in fixed mode: The data type of src1Pattern is uint8_t, and the value range is [1,7]. All repeat iterations use the same gather mask. src1RepeatStride cannot be configured. 1: 01010101…0101 # Obtain an even number of index elements for each repeat. 2: 10101010…1010 # Obtain an odd number of index elements for each repeat. 3: 00010001…0001 # Obtain the first element among every four elements in each repeat. 4: 00100010…0010 # Obtain the second element among every four elements in each repeat. 5: 01000100…0100 # Obtain the third element among every four elements in each repeat. 6: 10001000…1000 # Obtain the fourth element among every four elements in each repeat. 7: 11111111...1111 # Obtain all elements in each repeat. Atlas A3 training products/Atlas A3 inference products support modes 1 to 7. Atlas A2 training products/Atlas A2 inference products support modes 1 to 7. Atlas 200I/500 A2 inference products support modes 1 to 7. The Atlas inference product's AI Core supports modes 1 to 6. User-defined mode: The data type of src1Pattern is LocalTensor. The iteration interval is determined by src1RepeatStride. src1Pattern is continuously consumed in an iteration.
reduceMode	Input	This parameter is used to select the mode of the mask parameter, and it is of the bool type. The options are as follows: false: normal mode. In this mode, 256-byte data is operated in each repeat operation. The total data computation amount is as follows: repeatTimes x 256 bytes. The mask parameter is invalid. You are advised to set it to 0. Set repeatTimes, src0BlockStride, and src0RepeatStride as required. src1Pattern can be set to the built-in fixed mode or user-defined mode. In user-defined mode, src1RepeatStride can be set as required. true: counter mode. Based on the meanings of parameters such as mask, this mode can be configured in either of the following methods: Configuration method 1: mask elements are operated in each repeat operation. The total data computation amount is as follows: repeatTimes x Number of mask elements. The mask value is set to the number of elements computed in each repeat. Set repeatTimes, src0BlockStride, and src0RepeatStride as required. src1Pattern can be set to the built-in fixed mode or user-defined mode. In user-defined mode, src1RepeatStride can be set as required. Configuration method 2: The total data computation amount is the number of mask elements. The mask value is configured as the total data computation amount. The repeatTimes value does not take effect. The number of iterations of an instruction is determined by the source operand and mask. Set src0BlockStride and src0RepeatStride as required. src1Pattern can be set to the built-in fixed mode or user-defined mode. In user-defined mode, src1RepeatStride can be set as required. For Atlas A3 training products/Atlas A3 inference products, configuration method 1 is supported. For Atlas A2 training products/Atlas A2 inference products, configuration method 1 is supported. For Atlas 200I/500 A2 inference products, configuration method 1 is supported. For the Atlas inference product's AI Core, configuration method 2 is supported.
mask	Input	This parameter is used to control the elements that participate in computation in each iteration. Based on reduceMode, there are two mask modes: Normal mode: mask is invalid. You are advised to set this parameter to 0. Counter mode: The value range is [1, 2³² – 1]. The meaning of the mask parameter in counter mode varies according to the product model. For details about the configuration methods, see the description of the reduceMode parameter.
gatherMaskParams	Input	Data structure that controls the operand address strides, of the GatherMaskParams type. For details, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_gather.h. Replace ${INSTALL_DIR} with the actual CANN component directory. For details about the parameters, see Table 3.
rsvdCnt	Output	Number of elements retained by the instruction, corresponding to the number of valid elements in dstLocal. The data type is uint64_t.

**Table 3** Parameters in the **GatherMaskParams** structure
Parameter	Description
src0BlockStride	Address stride between different data blocks in an iteration of src0 (interval between start addresses). The unit is data block.
repeatTimes	Number of iteration repeats.
src0RepeatStride	Address stride between adjacent iterations of src0 (interval between start addresses). The unit is data block.
src1RepeatStride	Address stride between adjacent iterations of src1 (interval between start addresses). The unit is data block.

Returns

None

Restrictions

For details about the operand address alignment requirements, see General Address Alignment Restrictions.
For details about the constraints on operand address overlapping, see General Address Overlapping Restrictions.

If the counter mode is used before this API is called, the counter mode needs to be set again after this API is called. (The reason is that the counter mode will switch to the normal mode after this API is called.)

Example

Example of a custom tensor

#include "kernel_operator.h"
class KernelGatherMask {
public:
    __aicore__ inline KernelGatherMask () {}
    __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
    {
        src0Global.SetGlobalBuffer((__gm__ uint32_t*)src0Gm);
        src1Global.SetGlobalBuffer((__gm__ uint32_t*)src1Gm);
        dstGlobal.SetGlobalBuffer((__gm__ uint32_t*)dstGm);
        pipe.InitBuffer(inQueueSrc0, 1, 256 * sizeof(uint32_t));
        pipe.InitBuffer(inQueueSrc1, 1, 32 * sizeof(uint32_t));
        pipe.InitBuffer(outQueueDst, 1, 256 * sizeof(uint32_t));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<uint32_t> src0Local = inQueueSrc0.AllocTensor<uint32_t>();
        AscendC::LocalTensor<uint32_t> src1Local = inQueueSrc1.AllocTensor<uint32_t>();
        AscendC::DataCopy(src0Local, src0Global, 256);
        AscendC::DataCopy(src1Local, src1Global, 32);
        inQueueSrc0.EnQue(src0Local);
        inQueueSrc1.EnQue(src1Local);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<uint32_t> src0Local = inQueueSrc0.DeQue<uint32_t>();
        AscendC::LocalTensor<uint32_t> src1Local = inQueueSrc1.DeQue<uint32_t>();
        AscendC::LocalTensor<uint32_t> dstLocal = outQueueDst.AllocTensor<uint32_t>();
        uint32_t mask = 70;
       uint64_t rsvdCnt = 0;
        // reduceMode = true;    Use the counter mode.
        // src0BlockStride = 1. Data is continuously read and written at an interval of one data block in a single iteration.
        // repeatTimes = 2;      In counter mode, this parameter takes effect only on some product models.
        // src0RepeatStride = 4; The data interval between iterations of the source operand is four data blocks.
        // src1RepeatStride = 0; The data interval between iterations of src1 is 0 data blocks, that is, the data is read from the original position.
        AscendC::GatherMask (dstLocal, src0Local, src1Local, true, mask, { 1, 2, 4, 0 }, rsvdCnt);
        outQueueDst.EnQue<uint32_t>(dstLocal);
        inQueueSrc0.FreeTensor(src0Local);
        inQueueSrc1.FreeTensor(src1Local);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<uint32_t> dstLocal = outQueueDst.DeQue<uint32_t>();
        AscendC::DataCopy(dstGlobal, dstLocal, 256);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc0, inQueueSrc1;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<uint32_t> src0Global, src1Global, dstGlobal;
};
extern "C" __global__ __aicore__ void gather_mask_simple_kernel(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
{
    KernelGatherMask op;
    op.Init(src0Gm, src1Gm, dstGm);
    op.Process();
}

The following figure shows configuration method 1 in counter mode.

mask = 70. 70 elements are computed in each repeat.
repeatTimes = 2. There are two repeats in total.
src0BlockStride = 1. There is no interval between data blocks in a single iteration of the source operand src0Local.
src0RepeatStride = 4. The interval between adjacent iterations of the source operand src0Local is four data blocks. Therefore, the second repeat starts from the 33rd element.
src1Pattern is set to the user-defined mode. src1RepeatStride = 0. The interval between adjacent iterations of src1Pattern is 0 data blocks. Therefore, the second repeat still starts from the start address of src1Pattern.

Figure 1 Configuration method 1 in counter mode

The following figure shows configuration method 2 in counter mode.

mask = 70. A total of 70 elements are computed.
The configuration of repeatTimes does not take effect. It is automatically inferred based on the source operand and mask. The data type of the source operand is uint32_t. Each iteration processes 256 bytes of data (64 elements), so two repeats are required.
src0BlockStride = 1. There is no interval between data blocks in a single iteration of the source operand src0Local.
src0RepeatStride = 4. The interval between adjacent iterations of the source operand src0Local is four data blocks. Therefore, the second repeat starts from the 33rd element.
src1Pattern is set to the user-defined mode. src1RepeatStride = 0. The interval between adjacent iterations of src1Pattern is 0 data blocks. Therefore, the second repeat still starts from the start address of src1Pattern.

Figure 2 Configuration method 2 in counter mode

Built-in fixed mode

#include "kernel_operator.h"
class KernelGatherMask {
public:
    __aicore__ inline KernelGatherMask () {}
    __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* dstGm)
    {
        src0Global.SetGlobalBuffer((__gm__ uint16_t*)src0Gm);
        dstGlobal.SetGlobalBuffer((__gm__ uint16_t*)dstGm);
        pipe.InitBuffer(inQueueSrc0, 1, 128 * sizeof(uint16_t));
        pipe.InitBuffer(outQueueDst, 1, 128 * sizeof(uint16_t));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<uint16_t> src0Local = inQueueSrc0.AllocTensor<uint16_t>();
        AscendC::DataCopy(src0Local, src0Global, 128);
        inQueueSrc0.EnQue(src0Local);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<uint16_t> src0Local = inQueueSrc0.DeQue<uint16_t>();
        AscendC::LocalTensor<uint16_t> dstLocal = outQueueDst.AllocTensor<uint16_t>();
 
        uint32_t mask = 0; // You are advised to set mask to 0 in normal mode.
        uint64_t rsvdCnt = 0; // Save the number of elements retained after filtering.
        uint8_t src1Pattern = 2; // Built-in fixed mode.
        // reduceMode = false. Use the normal mode.
        // src0BlockStride = 1. Data is continuously read and written at an interval of one block in a single iteration.
        // repeatTimes = 1. Repeat iteration once.
        // src0RepeatStride = 0. Repeat once. Therefore, set src0RepeatStride to 0.
        // src1RepeatStride = 0. Repeat once. Therefore, set src1RepeatStride to 0.
        AscendC::GatherMask(dstLocal, src0Local, src1Pattern, false, mask, { 1, 1, 0, 0 }, rsvdCnt);
 
        outQueueDst.EnQue<uint16_t>(dstLocal);
        inQueueSrc0.FreeTensor(src0Local);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<uint16_t> dstLocal = outQueueDst.DeQue<uint16_t>();
        
        AscendC::DataCopy(dstGlobal, dstLocal, 128);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc0;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<uint16_t> src0Global, dstGlobal;
};
 
extern "C" __global__ __aicore__ void gather_mask_simple_kernel(__gm__ uint8_t* src0Gm, __gm__ uint8_t* dstGm)
{
    KernelGatherMask op;
    op.Init(src0Gm, dstGm);
    op.Process();
}

Result example:

Input (src0Local): [1 2 3 ... 128]
Input (src1Pattern): src1Pattern = 2
Output (dstLocal): [2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 undefined ..undefined]
Output (rsvdCnt): 64

Parent topic: Comparison and Selection