Scatter (ISASI)

Function Usage

Generates a new result tensor based on a given continuous input tensor, a destination address offset tensor, and the offset address, and distributes the input tensor to the result tensor.

Scatters elements in the source operand src to positions (specified by using dst_offset and base_addr) in the destination operand dst.

Prototype

  • Computation of the first n data elements of a tensor
    1
    2
    template <typename T>
    __aicore__ inline void Scatter(const LocalTensor<T>& dstLocal, const LocalTensor<T>& srcLocal, const LocalTensor<uint32_t>& dstOffsetLocal, const uint32_t dstBaseAddr, const uint32_t count)
    
  • High-dimensional tensor sharding computation
    • Bitwise mask mode
      1
      2
      template <typename T>
      __aicore__ inline void Scatter(const LocalTensor<T>& dstLocal, const LocalTensor<T>& srcLocal, const LocalTensor<uint32_t>& dstOffsetLocal, const uint32_t dstBaseAddr, const uint64_t mask[], const uint8_t repeatTimes, const uint8_t srcRepStride)
      
    • Contiguous mask mode
      1
      2
      template <typename T>
      __aicore__ inline void Scatter(const LocalTensor<T>& dstLocal, const LocalTensor<T>& srcLocal, const LocalTensor<uint32_t>& dstOffsetLocal, const uint32_t dstBaseAddr, const uint64_t mask, const uint8_t repeatTimes, const uint8_t srcRepStride)
      

Parameters

Table 1 Parameters in the template

Parameter

Description

T

Operand data type.

For the Atlas inference product 's AI Core, the supported data types are uint16_t, uint32_t, float, and half.

For the Atlas 200I/500 A2 inference products , the supported data types are uint8_t, int8_t, uint16_t, int16_t, half, uint32_t, int32_t, and float.

Table 2 Parameters

Parameter

Input/Output

Meaning

dstLocal

Output

Destination operand of the LocalTensor type. The address needs to be 32-byte aligned.

srcLocal

Input

Source operand of the LocalTensor type. The data type is the same as that of dstLocal.

dstOffsetLocal

Input

Address offset of each element in dst. The address offset must be greater than or equal to 0. The offset is relative to the dst base address and supports the data type uint32_t. The value unit is byte.

For the following models, the value range of the address offset must be within the range of uint32_t.

Atlas inference product 's AI Core

For the following models, the value range of the address offset is as follows: When the operand is 8-bit, the value range is [0, 216 – 1]. When the operand is 16-bit, the value range is [0, 217 – 1]. When the operand is 32-bit or 64-bit, the value range must be within the range of uint32_t; otherwise, unexpected output may occur.

Atlas 200I/500 A2 inference products

dstBaseAddr

Input

Start offset of dstLocal, in bytes.

count

Input

How many data elements to be processed. The value cannot exceed the number of srcLocal and dstOffsetLocal elements.

mask

Input

The mask parameter is used to control the elements involved in computation in each iteration.

  • Contiguous mode: indicates the number of contiguous elements that participate in computation. The value range is related to the operand data type. The maximum number of elements that can be processed in each repeat varies according to the data type. When the operand is 8-bit or 16-bit, mask ∈ [1, 128]. When the operand is 32-bit, mask ∈ [1, 64]. When the operand is 64-bit, mask ∈ [1, 32].
  • Bitwise mode: controls the elements that participate in computation by bit. If a bit is set to 1, the corresponding element participates in the computation. If a bit is set to 0, the corresponding element is masked in the computation. The parameter type is a uint64_t array with a length of 2.

    For example, if mask = [0, 8] and 8 = 0b1000, only the fourth element participates in computation.

    The parameter value range is related to the operand data type. The maximum number of elements that can be processed in each repeat varies according to the data type. When the operand is 8-bit or 16-bit, mask[0] and mask[1] ∈ [0, 264 – 1] and cannot be 0 at the same time. When the operand is 32-bit, mask[1] is 0 and mask[0] ∈ (0, 264 – 1]. When the operand is 64-bit, mask[1] is 0 and mask[0] ∈ (0, 232 – 1].

repeatTimes

Input

Number of instruction iterations. Data of eight data blocks is collected in each iteration. Data range: repeatTimes ∈ [0,255]

srcRepStride

Input

Address stride of the operand between adjacent iterations. The unit is data block.

Availability

Atlas inference product 's AI Core

Atlas 200I/500 A2 inference products

Constraints

Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
#include "kernel_operator.h"

template <typename T>
class ScatterTest {
public:
    __aicore__ inline ScatterTest() {}
    __aicore__ inline void Init(__gm__ uint8_t* dstGm, __gm__ uint8_t* srcGm,
        __gm__ uint8_t* dstOffsetGm, const uint32_t count)
    {
        m_elementCount = count;
        m_dstGlobal.SetGlobalBuffer((__gm__ T*)dstGm);
        m_srcGlobal.SetGlobalBuffer((__gm__ T*)srcGm);
        m_dstOffsetGlobal.SetGlobalBuffer((__gm__ uint32_t*)dstOffsetGm);
        m_pipe.InitBuffer(m_queIn, 2, m_elementCount * sizeof(uint32_t));
        m_pipe.InitBuffer(m_queOut, 1, m_elementCount * sizeof(uint32_t));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<T> srcLocal = m_queIn.AllocTensor<T>();
        AscendC::DataCopy(srcLocal, m_srcGlobal, m_elementCount);
        m_queIn.EnQue(srcLocal);
        AscendC::LocalTensor<uint32_t> dstOffsetLocal = m_queIn.AllocTensor<uint32_t>();
        AscendC::DataCopy(dstOffsetLocal, m_dstOffsetGlobal, m_elementCount);
        m_queIn.EnQue(dstOffsetLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<T> srcLocal = m_queIn.DeQue<T>();
        AscendC::LocalTensor<uint32_t> dstOffsetLocal = m_queIn.DeQue<uint32_t>();
        AscendC::LocalTensor<T> dstLocal = m_queOut.AllocTensor<T>();
        dstLocal.SetSize(m_elementCount);
        AscendC::Scatter(dstLocal, srcLocal, dstOffsetLocal, (uint32_t)0, m_elementCount);
        m_queIn.FreeTensor(srcLocal);
        m_queIn.FreeTensor(dstOffsetLocal);
        m_queOut.EnQue(dstLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<T> dstLocal = m_queOut.DeQue<T>();
        AscendC::DataCopy(m_dstGlobal, dstLocal, m_elementCount);
        m_queOut.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe m_pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> m_queCalc;
    AscendC::GlobalTensor<T> m_valueGlobal;
    uint32_t m_concatRepeatTimes;
    uint32_t m_sortRepeatTimes;
    uint32_t m_extractRepeatTimes;
    uint32_t m_elementCount;
    AscendC::GlobalTensor<uint32_t> m_dstOffsetGlobal;
    AscendC::GlobalTensor<T> m_srcGlobal;
    AscendC::GlobalTensor<T> m_dstGlobal;
    AscendC::TQue<AscendC::QuePosition::VECIN, 2> m_queIn;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> m_queOut;
}; // class ScatterTest

#define KERNEL_SCATTER(T, count)                                                                    \
    extern "C" __global__ __aicore__ void kernel_scatter_##T##_##count(GM_ADDR dstGm, GM_ADDR srcGm,\
        GM_ADDR dstOffsetGm)                                                                        \
    {                                                                                               \
        ScatterTest<T> op;                                                                          \
        op.Init(dstGm, srcGm, dstOffsetGm, count);                                                  \
        op.Process();                                                                               \
    }
Result example:
Input dstOffsetLocal:
[254 252 250 ... 4 2 0]
Input srcLocal (128 data elements of the half type):
[0 1 2 ... 125 126 127]
Output dstGlobal:
[127 126 125 ... 2 1 0]