GatherMask
Applicability
Product |
Supported/Unsupported |
|---|---|
√ |
|
√ |
|
√ |
|
√ |
|
x |
|
x |
Functions
Selects elements from the source operand and writes them to the destination operand based on a gather mask (for data collection) that corresponds to either the binary of the built-in fixed mode or the binary of the user-defined input tensor values.
Prototype
- Custom mode
1 2
template <typename T, typename U, GatherMaskMode mode = defaultGatherMaskMode> __aicore__ inline void GatherMask(const LocalTensor<T>& dst, const LocalTensor<T>& src0, const LocalTensor<U>& src1Pattern, const bool reduceMode, const uint32_t mask, const GatherMaskParams& gatherMaskParams, uint64_t& rsvdCnt)
- Built-in fixed mode
1 2
template <typename T, GatherMaskMode mode = defaultGatherMaskMode> __aicore__ inline void GatherMask(const LocalTensor<T>& dst, const LocalTensor<T>& src0, const uint8_t src1Pattern, const bool reduceMode, const uint32_t mask, const GatherMaskParams& gatherMaskParams, uint64_t& rsvdCnt)
Parameters
Parameter |
Meaning |
|---|---|
T |
Data types of the source operand src0 and destination operand dst. For the For the For the For the |
U |
Data type of src1Pattern in user-defined mode. The supported data types are , uint16_t, and uint32_t.
|
mode |
Reserved. A default value is provided. You do not need to set this parameter. |
Parameter |
Input/Output |
Meaning |
|---|---|---|
dst |
Output |
Destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. |
src0 |
Input |
Source operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. Its data type must be the same as that of the destination operand. |
src1Pattern |
Input |
There are two modes of gather mask (for data collection): built-in fixed mode and custom mode. Elements are from the source operand and written to the destination operand based on either the binary of the built-in fixed mode or the binary of the custom input tensor values. 1: selected; 0: not selected.
|
reduceMode |
Input |
This parameter is used to select the mode of the mask parameter, and it is of the bool type. The options are as follows:
|
mask |
Input |
This parameter is used to control the elements that participate in computation in each iteration. Based on reduceMode, there are two mask modes:
|
gatherMaskParams |
Input |
Data structure that controls the operand address strides, of the GatherMaskParams type. For details, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_gather.h. Replace ${INSTALL_DIR} with the actual CANN component directory. For details about the parameters, see Table 3. |
rsvdCnt |
Output |
Number of elements retained by the instruction, corresponding to the number of valid elements in dstLocal. The data type is uint64_t. |
Parameter |
Meaning |
|---|---|
src0BlockStride |
Address stride between different DataBlocks of the same iteration of src0 (interval between start addresses). The unit is DataBlock. |
repeatTimes |
Number of iteration repeats. |
src0RepeatStride |
Address stride between adjacent iterations of src0 (interval between start addresses). The unit is DataBlock. |
src1RepeatStride |
Address stride between adjacent iterations of src1 (interval between start addresses). The unit is DataBlock. |
Returns
None
Restrictions
- For details about the operand address alignment requirements, see General Address Alignment Restrictions.
- For details about the operand address overlapping restrictions, see General Address Overlap Restrictions.
- If the counter mode is used before this API is called, the counter mode needs to be set again after this API is called. (The reason is that the counter mode will switch to the normal mode after this API is called.)
Examples
- Example of a custom tensor
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
#include "kernel_operator.h" class KernelGatherMask { public: __aicore__ inline KernelGatherMask () {} __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm) { src0Global.SetGlobalBuffer((__gm__ uint32_t*)src0Gm); src1Global.SetGlobalBuffer((__gm__ uint32_t*)src1Gm); dstGlobal.SetGlobalBuffer((__gm__ uint32_t*)dstGm); pipe.InitBuffer(inQueueSrc0, 1, 256 * sizeof(uint32_t)); pipe.InitBuffer(inQueueSrc1, 1, 32 * sizeof(uint32_t)); pipe.InitBuffer(outQueueDst, 1, 256 * sizeof(uint32_t)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<uint32_t> src0Local = inQueueSrc0.AllocTensor<uint32_t>(); AscendC::LocalTensor<uint32_t> src1Local = inQueueSrc1.AllocTensor<uint32_t>(); AscendC::DataCopy(src0Local, src0Global, 256); AscendC::DataCopy(src1Local, src1Global, 32); inQueueSrc0.EnQue(src0Local); inQueueSrc1.EnQue(src1Local); } __aicore__ inline void Compute() { AscendC::LocalTensor<uint32_t> src0Local = inQueueSrc0.DeQue<uint32_t>(); AscendC::LocalTensor<uint32_t> src1Local = inQueueSrc1.DeQue<uint32_t>(); AscendC::LocalTensor<uint32_t> dstLocal = outQueueDst.AllocTensor<uint32_t>(); uint32_t mask = 70; uint64_t rsvdCnt = 0; // reduceMode = true; Use the counter mode. // src0BlockStride = 1. Data is continuously read and written at an interval of one data block in a single iteration. // repeatTimes = 2; In counter mode, this parameter takes effect only on some product models. // src0RepeatStride = 4; The data interval between iterations of the source operand is four data blocks. // src1RepeatStride = 0; The data interval between iterations of src1 is 0 data blocks, that is, the data is read from the original position. AscendC::GatherMask (dstLocal, src0Local, src1Local, true, mask, { 1, 2, 4, 0 }, rsvdCnt); outQueueDst.EnQue<uint32_t>(dstLocal); inQueueSrc0.FreeTensor(src0Local); inQueueSrc1.FreeTensor(src1Local); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<uint32_t> dstLocal = outQueueDst.DeQue<uint32_t>(); AscendC::DataCopy(dstGlobal, dstLocal, 256); outQueueDst.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc0, inQueueSrc1; AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst; AscendC::GlobalTensor<uint32_t> src0Global, src1Global, dstGlobal; }; extern "C" __global__ __aicore__ void gather_mask_simple_kernel(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm) { KernelGatherMask op; op.Init(src0Gm, src1Gm, dstGm); op.Process(); }
The following figure shows configuration method 1 in counter mode.
- mask = 70. 70 elements are computed in each repeat.
- repeatTimes = 2. There are two repeats in total.
- src0BlockStride = 1. There is no interval between data blocks in a single iteration of the source operand src0Local.
- src0RepeatStride = 4. The interval between adjacent iterations of the source operand src0Local is four data blocks. Therefore, the second repeat starts from the 33rd element.
- src1Pattern is set to the user-defined mode. src1RepeatStride = 0. The interval between adjacent iterations of src1Pattern is 0 data blocks. Therefore, the second repeat still starts from the start address of src1Pattern.
The following figure shows configuration method 2 in counter mode.
- mask = 70. A total of 70 elements are computed.
- The configuration of repeatTimes does not take effect. It is automatically inferred based on the source operand and mask. The data type of the source operand is uint32_t. Each iteration processes 256 bytes of data (64 elements), so two repeats are required.
- src0BlockStride = 1. There is no interval between data blocks in a single iteration of the source operand src0Local.
- src0RepeatStride = 4. The interval between adjacent iterations of the source operand src0Local is four data blocks. Therefore, the second repeat starts from the 33rd element.
- src1Pattern is set to the user-defined mode. src1RepeatStride = 0. The interval between adjacent iterations of src1Pattern is 0 data blocks. Therefore, the second repeat still starts from the start address of src1Pattern.
- Built-in fixed mode
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62
#include "kernel_operator.h" class KernelGatherMask { public: __aicore__ inline KernelGatherMask () {} __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* dstGm) { src0Global.SetGlobalBuffer((__gm__ uint16_t*)src0Gm); dstGlobal.SetGlobalBuffer((__gm__ uint16_t*)dstGm); pipe.InitBuffer(inQueueSrc0, 1, 128 * sizeof(uint16_t)); pipe.InitBuffer(outQueueDst, 1, 128 * sizeof(uint16_t)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<uint16_t> src0Local = inQueueSrc0.AllocTensor<uint16_t>(); AscendC::DataCopy(src0Local, src0Global, 128); inQueueSrc0.EnQue(src0Local); } __aicore__ inline void Compute() { AscendC::LocalTensor<uint16_t> src0Local = inQueueSrc0.DeQue<uint16_t>(); AscendC::LocalTensor<uint16_t> dstLocal = outQueueDst.AllocTensor<uint16_t>(); uint32_t mask = 0; // It is recommended that mask be set to 0 in normal mode. uint64_t rsvdCnt = 0; // Save the number of elements retained after filtering. uint8_t src1Pattern = 2; // Built-in fixed mode. // reduceMode = false. Use the normal mode. // src0BlockStride = 1. Data is continuously read and written at an interval of one block in a single iteration. // repeatTimes = 1. Repeat iteration once. // src0RepeatStride = 0. Repeat once. Therefore, set src0RepeatStride to 0. // src1RepeatStride = 0. Repeat once. Therefore, set src1RepeatStride to 0. AscendC::GatherMask(dstLocal, src0Local, src1Pattern, false, mask, { 1, 1, 0, 0 }, rsvdCnt); outQueueDst.EnQue<uint16_t>(dstLocal); inQueueSrc0.FreeTensor(src0Local); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<uint16_t> dstLocal = outQueueDst.DeQue<uint16_t>(); AscendC::DataCopy(dstGlobal, dstLocal, 128); outQueueDst.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc0; AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst; AscendC::GlobalTensor<uint16_t> src0Global, dstGlobal; }; extern "C" __global__ __aicore__ void gather_mask_simple_kernel(__gm__ uint8_t* src0Gm, __gm__ uint8_t* dstGm) { KernelGatherMask op; op.Init(src0Gm, dstGm); op.Process(); }
Result example:
Input (src0Local): [1 2 3 ... 128] Input (src1Pattern): src1Pattern = 2; Output data dstLocal: [2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 undefined ..undefined] Output (rsvdCnt): 64

