GatherMask
Function Usage
Selects elements from the source operand and writes them to the destination operand based on a gather mask (for data collection) that corresponds to either the binary of the built-in fixed mode or the binary of the user-defined input tensor values.
Prototype
- Custom mode
1 2
template <typename T, typename U, GatherMaskMode mode = defaultGahterMaskMode> __aicore__ inline void GatherMask(const LocalTensor<T>& dstLocal, const LocalTensor<T>& src0Local, const LocalTensor<U>& src1Pattern, const bool reduceMode, const uint32_t mask, const GatherMaskParams& gatherMaskParams, uint64_t& rsvdCnt)
- Built-in fixed mode
1 2
template <typename T, GatherMaskMode mode = defaultGahterMaskMode> __aicore__ inline void GatherMask(const LocalTensor<T>& dstLocal, const LocalTensor<T>& src0Local, const uint8_t src1Pattern, const bool reduceMode, const uint32_t mask, const GatherMaskParams& gatherMaskParams, uint64_t& rsvdCnt)
Parameters
|
Parameter |
Meaning |
|---|---|
|
T |
Data type of the source operand src0Local and destination operand dstLocal. |
|
U |
Data type of src1Pattern in user-defined mode. The supported data types are , uint16_t, and uint32_t.
|
|
mode |
Reserved. A default value is provided. You do not need to set this parameter. |
|
Parameter |
Input/Output |
Meaning |
|---|---|---|
|
dstLocal |
Output |
Destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. |
|
src0Local |
Input |
Source operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. Its data type must be the same as that of the destination operand. |
|
src1Pattern |
Input |
There are two modes of gather mask (for data collection): built-in fixed mode and custom mode. Elements are from the source operand and written to the destination operand based on either the binary of the built-in fixed mode or the binary of the custom input tensor values. 1: selected; 0: not selected.
|
|
reduceMode |
Input |
This parameter is used to select the mode of the mask parameter, and it is of the bool type. The options are as follows:
|
|
mask |
Input |
This parameter is used to control the elements that participate in computation in each iteration. Based on reduceMode, there are two mask modes:
|
|
gatherMaskParams |
Input |
Data structure that controls the operand address strides. It contains such parameters as those that specify the address stride of the operand for the same data block between adjacent iterations and address stride of the operand between different data blocks in a single iteration. The data structure is defined as follows: struct GatherMaskParams{
uint8_t src0BlockStride;
uint16_t repeatTimes;
uint16_t src0RepeatStride;
uint8_t src1RepeatStride;
};
For details about the address stride of the operand between adjacent iterations, see repeatStride. For details about the address stride of the operand between different data blocks in a single iteration, see dataBlockStride. |
|
rsvdCnt |
Output |
Number of elements retained by the instruction, corresponding to the number of valid elements in dstLocal. The data type is uint64_t. |
Returns
None
Availability
Precautions
- To save memory space, you can define a tensor shared by the source and destination operands (by address overlapping). The general instruction restrictions are as follows.
- In a single iteration, the source operand must completely overlap the destination operand. Partial overlapping is not supported.
- During multiple iterations, if the Nth destination operand is the (N + 1)th source operand, address overlapping is not supported.
- For details about the alignment requirements of the operand address offset, see General Restrictions.
- If the counter mode is used before this API is called, the counter mode needs to be set again after this API is called. (The reason is that the counter mode will switch to the normal mode after this API is called.)
Example
- Example of a custom tensor
#include "kernel_operator.h" class KernelGatherMask { public: __aicore__ inline KernelGatherMask () {} __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm) { src0Global.SetGlobalBuffer((__gm__ uint32_t*)src0Gm); src1Global.SetGlobalBuffer((__gm__ uint32_t*)src1Gm); dstGlobal.SetGlobalBuffer((__gm__ uint32_t*)dstGm); pipe.InitBuffer(inQueueSrc0, 1, 256 * sizeof(uint32_t)); pipe.InitBuffer(inQueueSrc1, 1, 32 * sizeof(uint32_t)); pipe.InitBuffer(outQueueDst, 1, 256 * sizeof(uint32_t)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<uint32_t> src0Local = inQueueSrc0.AllocTensor<uint32_t>(); AscendC::LocalTensor<uint32_t> src1Local = inQueueSrc1.AllocTensor<uint32_t>(); AscendC::DataCopy(src0Local, src0Global, 256); AscendC::DataCopy(src1Local, src1Global, 32); inQueueSrc0.EnQue(src0Local); inQueueSrc1.EnQue(src1Local); } __aicore__ inline void Compute() { AscendC::LocalTensor<uint32_t> src0Local = inQueueSrc0.DeQue<uint32_t>(); AscendC::LocalTensor<uint32_t> src1Local = inQueueSrc1.DeQue<uint32_t>(); AscendC::LocalTensor<uint32_t> dstLocal = outQueueDst.AllocTensor<uint32_t>(); uint32_t mask = 70; uint64_t rsvdCnt = 0; // reduceMode = true; Use the counter mode. // src0BlockStride = 1. Data is continuously read and written at an interval of one data block in a single iteration. // repeatTimes = 2; In counter mode, this parameter takes effect only on some product models. // src0RepeatStride = 4; The data interval between iterations of the source operand is four data blocks. // src1RepeatStride = 0; The data interval between iterations of src1 is 0 data blocks, that is, the data is read from the original position. AscendC::GatherMask (dstLocal, src0Local, src1Local, true, mask, { 1, 2, 4, 0 }, rsvdCnt); outQueueDst.EnQue<uint32_t>(dstLocal); inQueueSrc0.FreeTensor(src0Local); inQueueSrc1.FreeTensor(src1Local); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<uint32_t> dstLocal = outQueueDst.DeQue<uint32_t>(); AscendC::DataCopy(dstGlobal, dstLocal, 256); outQueueDst.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc0, inQueueSrc1; AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst; AscendC::GlobalTensor<uint32_t> src0Global, src1Global, dstGlobal; }; extern "C" __global__ __aicore__ void gather_mask_simple_kernel(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm) { KernelGatherMask op; op.Init(src0Gm, src1Gm, dstGm); op.Process(); }The following figure shows configuration method 1 in counter mode.
- mask = 70. 70 elements are computed in each repeat.
- repeatTimes = 2. There are two repeats in total.
- src0BlockStride = 1. There is no interval between data blocks in a single iteration of the source operand src0Local.
- src0RepeatStride = 4. The interval between adjacent iterations of the source operand src0Local is four data blocks. Therefore, the second repeat starts from the 33rd element.
- src1Pattern is set to the user-defined mode. src1RepeatStride = 0. The interval between adjacent iterations of src1Pattern is 0 data blocks. Therefore, the second repeat still starts from the start address of src1Pattern.
The following figure shows configuration method 2 in counter mode.
- mask = 70. A total of 70 elements are computed.
- The configuration of repeatTimes does not take effect. It is automatically inferred based on the source operand and mask. The data type of the source operand is uint32_t. Each iteration processes 256 bytes of data (64 elements), so two repeats are required.
- src0BlockStride = 1. There is no interval between data blocks in a single iteration of the source operand src0Local.
- src0RepeatStride = 4. The interval between adjacent iterations of the source operand src0Local is four data blocks. Therefore, the second repeat starts from the 33rd element.
- src1Pattern is set to the user-defined mode. src1RepeatStride = 0. The interval between adjacent iterations of src1Pattern is 0 data blocks. Therefore, the second repeat still starts from the start address of src1Pattern.
- Built-in fixed mode
#include "kernel_operator.h" class KernelGatherMask { public: __aicore__ inline KernelGatherMask () {} __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* dstGm) { src0Global.SetGlobalBuffer((__gm__ uint16_t*)src0Gm); dstGlobal.SetGlobalBuffer((__gm__ uint16_t*)dstGm); pipe.InitBuffer(inQueueSrc0, 1, 128 * sizeof(uint16_t)); pipe.InitBuffer(outQueueDst, 1, 128 * sizeof(uint16_t)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<uint16_t> src0Local = inQueueSrc0.AllocTensor<uint16_t>(); AscendC::DataCopy(src0Local, src0Global, 128); inQueueSrc0.EnQue(src0Local); } __aicore__ inline void Compute() { AscendC::LocalTensor<uint16_t> src0Local = inQueueSrc0.DeQue<uint16_t>(); AscendC::LocalTensor<uint16_t> dstLocal = outQueueDst.AllocTensor<uint16_t>(); uint32_t mask = 0; // Set mask to 0 in normal mode. uint64_t rsvdCnt = 0; // Save the number of elements retained after filtering. uint8_t src1Pattern = 2; // Built-in fixed mode // reduceMode = false. Use the normal mode. // src0BlockStride = 1. Data is continuously read and written at an interval of one block in a single iteration. // repeatTimes = 1. Repeat iteration once. // src0RepeatStride = 0. Repeat once. Therefore, set src0RepeatStride to 0. // src1RepeatStride = 0. Repeat once. Therefore, set src1RepeatStride to 0. AscendC::GatherMask(dstLocal, src0Local, src1Pattern, false, mask, { 1, 1, 0, 0 }, rsvdCnt); outQueueDst.EnQue<uint16_t>(dstLocal); inQueueSrc0.FreeTensor(src0Local); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<uint16_t> dstLocal = outQueueDst.DeQue<uint16_t>(); AscendC::DataCopy(dstGlobal, dstLocal, 128); outQueueDst.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc0; AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst; AscendC::GlobalTensor<uint16_t> src0Global, dstGlobal; }; extern "C" __global__ __aicore__ void gather_mask_simple_kernel(__gm__ uint8_t* src0Gm, __gm__ uint8_t* dstGm) { KernelGatherMask op; op.Init(src0Gm, dstGm); op.Process(); }Result example:
Input (src0Local): [1 2 3 ... 128] Input (src1Pattern): src1Pattern = 2; Output (dstLocal): [2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 undefine ..undefined] Output (rsvdCnt): 64

