GatherMask

Applicability

Product

Supported/Unsupported

Atlas A3 training products/Atlas A3 inference products

Atlas A2 training products/Atlas A2 inference products

Atlas 200I/500 A2 inference products

Atlas inference product's AI Core

Atlas inference product's Vector Core

x

Atlas training products

x

Functions

Selects elements from the source operand and writes them to the destination operand based on a gather mask (for data collection) that corresponds to either the binary of the built-in fixed mode or the binary of the user-defined input tensor values.

Prototype

  • Custom mode
    1
    2
    template <typename T, typename U, GatherMaskMode mode = defaultGatherMaskMode>
    __aicore__ inline void GatherMask(const LocalTensor<T>& dst, const LocalTensor<T>& src0, const LocalTensor<U>& src1Pattern, const bool reduceMode, const uint32_t mask, const GatherMaskParams& gatherMaskParams, uint64_t& rsvdCnt)
    
  • Built-in fixed mode
    1
    2
    template <typename T, GatherMaskMode mode = defaultGatherMaskMode>
    __aicore__ inline void GatherMask(const LocalTensor<T>& dst, const LocalTensor<T>& src0, const uint8_t src1Pattern, const bool reduceMode, const uint32_t mask, const GatherMaskParams& gatherMaskParams, uint64_t& rsvdCnt)
    

Parameters

Table 1 Parameters in the template

Parameter

Meaning

T

Data types of the source operand src0 and destination operand dst.

For the Atlas A3 training products/Atlas A3 inference products, the supported data types are half, bfloat16_t, uint16_t, int16_t, float, uint32_t, and int32_t.

For the Atlas A2 training products/Atlas A2 inference products, the supported data types are half, bfloat16_t, uint16_t, int16_t, float, uint32_t, and int32_t.

For the Atlas 200I/500 A2 inference products, the supported data types are half, uint16_t, int16_t, float, uint32_t, and int32_t.

For the Atlas inference product's AI Core, the supported data types are half, uint16_t, int16_t, float, uint32_t, and int32_t.

U

Data type of src1Pattern in user-defined mode. The supported data types are , uint16_t, and uint32_t.

  • If the data type of the destination operand is half, uint16_t, int16_t, or , src1Pattern must be of the uint16_t type.
  • If the data type of the destination operand is float, uint32_t, or int32_t, src1Pattern must be of the uint32_t data type.

mode

Reserved. A default value is provided. You do not need to set this parameter.

Table 2 Parameters

Parameter

Input/Output

Meaning

dst

Output

Destination operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The start address of the LocalTensor must be 32-byte aligned.

src0

Input

Source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The start address of the LocalTensor must be 32-byte aligned.

Its data type must be the same as that of the destination operand.

src1Pattern

Input

There are two modes of gather mask (for data collection): built-in fixed mode and custom mode. Elements are from the source operand and written to the destination operand based on either the binary of the built-in fixed mode or the binary of the custom input tensor values. 1: selected; 0: not selected.

  • Built-in fixed mode: The data type of src1Pattern is uint8_t, and the value range is [1,7]. All repeat iterations use the same gather mask. src1RepeatStride cannot be configured.
    • 1: 01010101…0101 # Obtain an even number of index elements for each repeat.
    • 2: 10101010…1010 # Obtain an odd number of index elements for each repeat.
    • 3: 00010001…0001 # Obtain the first element among every four elements in each repeat.
    • 4: 00100010…0010 # Obtain the second element among every four elements in each repeat.
    • 5: 01000100…0100 # Obtain the third element among every four elements in each repeat.
    • 6: 10001000…1000 # Obtain the fourth element among every four elements in each repeat.
    • 7: 11111111...1111 # Obtain all elements in each repeat.

    The Atlas A3 training products/Atlas A3 inference products supports modes 1 to 7.

    The Atlas A2 training products/Atlas A2 inference products supports modes 1 to 7.

    The Atlas 200I/500 A2 inference products supports modes 1 to 7.

    The Atlas inference product's AI Core supports modes 1 to 6.

  • User-defined mode: The data type of src1Pattern is LocalTensor. The iteration interval is determined by src1RepeatStride. src1Pattern is continuously consumed in an iteration.

reduceMode

Input

This parameter is used to select the mode of the mask parameter, and it is of the bool type. The options are as follows:

  • false: normal mode. In this mode, 256-byte data is operated in each repeat operation. The total data computation amount is as follows: repeatTimes x 256 bytes.
    • The mask parameter is invalid. You are advised to set it to 0.
    • Set repeatTimes, src0BlockStride, and src0RepeatStride as required.
    • src1Pattern can be set to the built-in fixed mode or user-defined mode. In user-defined mode, src1RepeatStride can be set as required.
  • true: counter mode. Based on the meanings of parameters such as mask, this mode can be configured in either of the following methods:
    • Configuration method 1: mask elements are operated in each repeat operation. The total data computation amount is as follows: repeatTimes x Number of mask elements.
      • The mask value is set to the number of elements computed in each repeat.
      • Set repeatTimes, src0BlockStride, and src0RepeatStride as required.
      • src1Pattern can be set to the built-in fixed mode or user-defined mode. In user-defined mode, src1RepeatStride can be set as required.
    • Configuration method 2: The total data computation amount is the number of mask elements.
      • The mask value is configured as the total data computation amount.
      • The repeatTimes value does not take effect. The number of iterations of an instruction is determined by the source operand and mask.
      • Set src0BlockStride and src0RepeatStride as required.
      • src1Pattern can be set to the built-in fixed mode or user-defined mode. In user-defined mode, src1RepeatStride can be set as required.

    For the Atlas A3 training products/Atlas A3 inference products, configuration method 1 is supported.

    For the Atlas A2 training products/Atlas A2 inference products, configuration method 1 is supported.

    For the Atlas 200I/500 A2 inference products, configuration method 1 is supported.

    For the Atlas inference product's AI Core, configuration method 2 is supported.

mask

Input

This parameter is used to control the elements that participate in computation in each iteration. Based on reduceMode, there are two mask modes:

  • Normal mode: The mask is invalid. You are advised to set this parameter to 0.
  • Counter mode: The value range is [1, 232 – 1]. The meaning of the mask parameter in counter mode varies according to the product model. For details about the configuration methods, see the description of the reduceMode parameter.

gatherMaskParams

Input

Data structure that controls the operand address strides, of the GatherMaskParams type.

For details, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_gather.h. Replace ${INSTALL_DIR} with the actual CANN component directory.

For details about the parameters, see Table 3.

rsvdCnt

Output

Number of elements retained by the instruction, corresponding to the number of valid elements in dstLocal. The data type is uint64_t.

Table 3 Parameters in the GatherMaskParams structure

Parameter

Meaning

src0BlockStride

Address stride between different DataBlocks of the same iteration of src0 (interval between start addresses). The unit is DataBlock.

repeatTimes

Number of iteration repeats.

src0RepeatStride

Address stride between adjacent iterations of src0 (interval between start addresses). The unit is DataBlock.

src1RepeatStride

Address stride between adjacent iterations of src1 (interval between start addresses). The unit is DataBlock.

Returns

None

Restrictions

  • If the counter mode is used before this API is called, the counter mode needs to be set again after this API is called. (The reason is that the counter mode will switch to the normal mode after this API is called.)

Examples

  • Example of a custom tensor
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    #include "kernel_operator.h"
    class KernelGatherMask {
    public:
        __aicore__ inline KernelGatherMask () {}
        __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
        {
            src0Global.SetGlobalBuffer((__gm__ uint32_t*)src0Gm);
            src1Global.SetGlobalBuffer((__gm__ uint32_t*)src1Gm);
            dstGlobal.SetGlobalBuffer((__gm__ uint32_t*)dstGm);
            pipe.InitBuffer(inQueueSrc0, 1, 256 * sizeof(uint32_t));
            pipe.InitBuffer(inQueueSrc1, 1, 32 * sizeof(uint32_t));
            pipe.InitBuffer(outQueueDst, 1, 256 * sizeof(uint32_t));
        }
        __aicore__ inline void Process()
        {
            CopyIn();
            Compute();
            CopyOut();
        }
    private:
        __aicore__ inline void CopyIn()
        {
            AscendC::LocalTensor<uint32_t> src0Local = inQueueSrc0.AllocTensor<uint32_t>();
            AscendC::LocalTensor<uint32_t> src1Local = inQueueSrc1.AllocTensor<uint32_t>();
            AscendC::DataCopy(src0Local, src0Global, 256);
            AscendC::DataCopy(src1Local, src1Global, 32);
            inQueueSrc0.EnQue(src0Local);
            inQueueSrc1.EnQue(src1Local);
        }
        __aicore__ inline void Compute()
        {
            AscendC::LocalTensor<uint32_t> src0Local = inQueueSrc0.DeQue<uint32_t>();
            AscendC::LocalTensor<uint32_t> src1Local = inQueueSrc1.DeQue<uint32_t>();
            AscendC::LocalTensor<uint32_t> dstLocal = outQueueDst.AllocTensor<uint32_t>();
            uint32_t mask = 70;
           uint64_t rsvdCnt = 0;
            // reduceMode = true;    Use the counter mode.
            // src0BlockStride = 1. Data is continuously read and written at an interval of one data block in a single iteration.
            // repeatTimes = 2;      In counter mode, this parameter takes effect only on some product models.
            // src0RepeatStride = 4; The data interval between iterations of the source operand is four data blocks.
            // src1RepeatStride = 0; The data interval between iterations of src1 is 0 data blocks, that is, the data is read from the original position.
            AscendC::GatherMask (dstLocal, src0Local, src1Local, true, mask, { 1, 2, 4, 0 }, rsvdCnt);
            outQueueDst.EnQue<uint32_t>(dstLocal);
            inQueueSrc0.FreeTensor(src0Local);
            inQueueSrc1.FreeTensor(src1Local);
        }
        __aicore__ inline void CopyOut()
        {
            AscendC::LocalTensor<uint32_t> dstLocal = outQueueDst.DeQue<uint32_t>();
            AscendC::DataCopy(dstGlobal, dstLocal, 256);
            outQueueDst.FreeTensor(dstLocal);
        }
    private:
        AscendC::TPipe pipe;
        AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc0, inQueueSrc1;
        AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
        AscendC::GlobalTensor<uint32_t> src0Global, src1Global, dstGlobal;
    };
    extern "C" __global__ __aicore__ void gather_mask_simple_kernel(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
    {
        KernelGatherMask op;
        op.Init(src0Gm, src1Gm, dstGm);
        op.Process();
    }
    

    The following figure shows configuration method 1 in counter mode.

    • mask = 70. 70 elements are computed in each repeat.
    • repeatTimes = 2. There are two repeats in total.
    • src0BlockStride = 1. There is no interval between data blocks in a single iteration of the source operand src0Local.
    • src0RepeatStride = 4. The interval between adjacent iterations of the source operand src0Local is four data blocks. Therefore, the second repeat starts from the 33rd element.
    • src1Pattern is set to the user-defined mode. src1RepeatStride = 0. The interval between adjacent iterations of src1Pattern is 0 data blocks. Therefore, the second repeat still starts from the start address of src1Pattern.
    Figure 1 Configuration method 1 in counter mode

    The following figure shows configuration method 2 in counter mode.

    • mask = 70. A total of 70 elements are computed.
    • The configuration of repeatTimes does not take effect. It is automatically inferred based on the source operand and mask. The data type of the source operand is uint32_t. Each iteration processes 256 bytes of data (64 elements), so two repeats are required.
    • src0BlockStride = 1. There is no interval between data blocks in a single iteration of the source operand src0Local.
    • src0RepeatStride = 4. The interval between adjacent iterations of the source operand src0Local is four data blocks. Therefore, the second repeat starts from the 33rd element.
    • src1Pattern is set to the user-defined mode. src1RepeatStride = 0. The interval between adjacent iterations of src1Pattern is 0 data blocks. Therefore, the second repeat still starts from the start address of src1Pattern.
    Figure 2 Configuration method 2 in counter mode
  • Built-in fixed mode
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    #include "kernel_operator.h"
    class KernelGatherMask {
    public:
        __aicore__ inline KernelGatherMask () {}
        __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* dstGm)
        {
            src0Global.SetGlobalBuffer((__gm__ uint16_t*)src0Gm);
            dstGlobal.SetGlobalBuffer((__gm__ uint16_t*)dstGm);
            pipe.InitBuffer(inQueueSrc0, 1, 128 * sizeof(uint16_t));
            pipe.InitBuffer(outQueueDst, 1, 128 * sizeof(uint16_t));
        }
        __aicore__ inline void Process()
        {
            CopyIn();
            Compute();
            CopyOut();
        }
    private:
        __aicore__ inline void CopyIn()
        {
            AscendC::LocalTensor<uint16_t> src0Local = inQueueSrc0.AllocTensor<uint16_t>();
            AscendC::DataCopy(src0Local, src0Global, 128);
            inQueueSrc0.EnQue(src0Local);
        }
        __aicore__ inline void Compute()
        {
            AscendC::LocalTensor<uint16_t> src0Local = inQueueSrc0.DeQue<uint16_t>();
            AscendC::LocalTensor<uint16_t> dstLocal = outQueueDst.AllocTensor<uint16_t>();
     
            uint32_t mask = 0; // It is recommended that mask be set to 0 in normal mode.
            uint64_t rsvdCnt = 0; // Save the number of elements retained after filtering.
            uint8_t src1Pattern = 2; // Built-in fixed mode.
            // reduceMode = false. Use the normal mode.
            // src0BlockStride = 1. Data is continuously read and written at an interval of one block in a single iteration.
            // repeatTimes = 1. Repeat iteration once.
            // src0RepeatStride = 0. Repeat once. Therefore, set src0RepeatStride to 0.
            // src1RepeatStride = 0. Repeat once. Therefore, set src1RepeatStride to 0.
            AscendC::GatherMask(dstLocal, src0Local, src1Pattern, false, mask, { 1, 1, 0, 0 }, rsvdCnt);
     
            outQueueDst.EnQue<uint16_t>(dstLocal);
            inQueueSrc0.FreeTensor(src0Local);
        }
        __aicore__ inline void CopyOut()
        {
            AscendC::LocalTensor<uint16_t> dstLocal = outQueueDst.DeQue<uint16_t>();
            
            AscendC::DataCopy(dstGlobal, dstLocal, 128);
            outQueueDst.FreeTensor(dstLocal);
        }
    private:
        AscendC::TPipe pipe;
        AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc0;
        AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
        AscendC::GlobalTensor<uint16_t> src0Global, dstGlobal;
    };
     
    extern "C" __global__ __aicore__ void gather_mask_simple_kernel(__gm__ uint8_t* src0Gm, __gm__ uint8_t* dstGm)
    {
        KernelGatherMask op;
        op.Init(src0Gm, dstGm);
        op.Process();
    }
    

    Result example:

    Input (src0Local): [1 2 3 ... 128]
    Input (src1Pattern): src1Pattern = 2;
    Output data dstLocal: [2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50 52 54 56 58 60 62 64 66 68 70 72 74 76 78 80 82 84 86 88 90 92 94 96 98 100 102 104 106 108 110 112 114 116 118 120 122 124 126 128 undefined ..undefined]
    Output (rsvdCnt): 64