Slice Data Movement

Applicability

Product

Supported/Unsupported

Atlas A3 training products/Atlas A3 inference products

Atlas A2 training products/Atlas A2 inference products

Atlas 200I/500 A2 inference products

x

Atlas inference product's AI Core

Atlas inference product's Vector Core

x

Atlas training products

x

Functions

Supports slice-based data movement, extracting a subset of multi-dimensional tensor data for movement.

Prototype

  • Global Memory -> Local Memory
    1
    2
    template <typename T>
    __aicore__ inline void DataCopy(const LocalTensor<T>& dst, const GlobalTensor<T>& src, const SliceInfo dstSliceInfo[], const SliceInfo srcSliceInfo[], const uint32_t dimValue = 1)
    
  • Local Memory -> Global Memory
    1
    2
    template <typename T>
    __aicore__ inline void DataCopy(const GlobalTensor<T> &dst, const LocalTensor<T> &src, const SliceInfo dstSliceInfo[], const SliceInfo srcSliceInfo[], const uint32_t dimValue = 1)
    

For details about the supported data paths and data types of each prototype, see Supported Channels and Data Types.

Parameters

Table 1 Parameters in the template

Parameter

Description

T

Data type of the source operand and destination operand. For details about the supported data types, see Supported Channels and Data Types.

Table 2 Parameters of the slice data movement APIs

Parameter

Input/Output

Meaning

dst

Output

Destination operand of the LocalTensor or GlobalTensor type.

src

Input

Source operand of the LocalTensor or GlobalTensor type.

srcSliceInfo/dstSliceInfo

Input

Slice information of the destination or source operand, which is of the SliceInfo type.

For details, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_data_copy.h. Replace ${INSTALL_DIR} with the actual CANN component directory.

dimValue

Input

Operand dimension information. The default value is 1.

Table 3 Parameters in the SliceInfo structure

Field

Meaning

startIndex

Location of the start element of a slice.

endIndex

Location of the end element of a slice.

stride

Number of interval elements between slices.

burstLen

Length of each horizontal data slice. It takes effect only when the dimension is one. If the dimension is greater than one, it must be set to 1. The unit is data block (32 bytes). For example, if the list of srcSliceInfo is {{16, 70, 7, 3, 87}, {0, 2, 1, 1, 3}}, then {16, 70, 7, 3, 87} indicates the slice information in the first dimension, burstLen is set to 3 (indicating that the size of a slice data segment is three data blocks); {0, 2, 1, 1, 3} indicates the slice information in the second dimension, and burstLen can only be set to 1.

shapeValue

Original length of the current dimension. The unit is element.

The preceding parameters are parsed by using specific examples. The following figure shows the details.

Figure 1 Parameter parsing
  • dimValue is 2, indicating that the dimension is two for the operand.
  • srcSliceInfo is {{16, 70, 7, 3, 87}, {0, 2, 1, 1, 3}}.
    • {16, 70, 7, 3, 87} is configured for a single row, that is, from a one-dimensional perspective. Each element represents a number.

      startIndex = 16 indicates that the valid data segment starts from the 16th number.

      endIndex = 70 indicates that the valid data segment ends at the 70th number.

      stride = 7 and the unit is element, indicating the number of elements between two adjacent slice data segments. In the figure, the stride is seven zeros.

      burstLen = 3 and the unit is 32 bytes, indicating that the size of a slice data segment is three data blocks in this valid data segment.

      shapeValue = 87 indicates the length of a single row (unit: element), that is, 8 × 10 + 7 = 87 elements.

    • {0, 2, 1, 1, 3} is configured for multiple rows, that is, from a two-dimensional perspective. Each element represents one row.

      startIndex = 0 indicates that the valid data segment starts from row 0.

      endIndex = 2 indicates that the valid data segment ends at row 2.

      stride = 1 indicates that the interval element between two adjacent slice data segments is one row.

      burstLen must be set to 1 when dimValue is greater than 1.

      shapeValue = 3 indicates that there are three rows in total.

  • dstSliceInfo is {{0, 47, 0, 3, 48}, {0, 1, 0, 1, 2}}.
    • {0, 47, 0, 3, 48} is configured for a single row, that is, from a one-dimensional perspective. Each element represents a number.

      startIndex = 0 indicates that the valid data segment starts from the 0th number.

      endIndex = 47 indicates that the valid data segment ends at the 47th number.

      stride = 0 and the unit is element, indicating the number of elements between two adjacent slice data segments. That is, there is no distance between the two slice data segments.

      burstLen = 3 and the unit is 32 bytes, indicating that the size of a slice data segment is three data blocks in this valid data segment.

      shapeValue = 48 indicates the length of a single row (unit: element), that is, 8 × 6 = 48 elements.

    • {0, 1, 0, 1, 2} is configured for multiple rows, that is, from a two-dimensional perspective. Each element represents one row.

      startIndex = 0 indicates that the valid data segment starts from row 0.

      endIndex = 1 indicates that the valid data segment ends at row 1.

      stride = 0 indicates that there is no distance between two adjacent slice data segments.

      burstLen must be set to 1 when dimValue is greater than 1.

      shapeValue = 2 indicates that there are two rows in total.

Returns

None

Restrictions

  • You need to compute the size of the horizontal burstLen during slice data movements using the following formula: Number of horizontal slice elements x sizeof(T)/32 bytes. The number of horizontal slice elements multiplied by sizeof(T) must be a multiple of 32 bytes.
  • During slice data movements, the size of the SliceInfo structure array must be the same as that of dimValue and cannot exceed 8.
  • During slice data movements, the size of the srcSliceInfo array must be the same as that of the dstSliceInfo array, and the burstLen in the two structures must be the same (srcSliceInfo[i].burstLen = dstSliceInfo[i].burstLen).
  • Slice data movements have certain requirements on parameters. You are advised to move slice data on the NPU side after the simulation result on the CPU is correct by referring to the calling examples.

Supported Channels and Data Types

The following data channels are expressed by using the logical position TPosition and the corresponding physical channels are also specified. For details about the mapping between TPosition and physical memory, see Table 1.

Table 4 Global memory to local memory data paths and supported data types

Internal Model

Datapath

Data Types of the Source and Destination Operands (Same)

Atlas inference product's AI Core

GM -> VECIN (GM -> UB)

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, half, float

Atlas A2 training products/Atlas A2 inference products

GM -> VECIN (GM -> UB)

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, half, bfloat16_t, float

Atlas A3 training products/Atlas A3 inference products

GM -> VECIN (GM -> UB)

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, half, bfloat16_t, float

Table 5 Local memory to global memory data paths and supported data types

Internal Model

Datapath

Data Types of the Source and Destination Operands (Same)

Atlas inference product's AI Core

VECOUT, CO2 -> GM (UB -> GM)

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, half, float

Atlas A2 training products/Atlas A2 inference products

VECOUT -> GM (UB -> GM)

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, half, bfloat16_t, float

Atlas A3 training products/Atlas A3 inference products

VECOUT -> GM (UB -> GM)

int8_t, uint8_t, int16_t, uint16_t, int32_t, uint32_t, half, bfloat16_t, float

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
#include "kernel_operator.h"
// In this sample, the tensor data type is float.
template <typename T>
class KernelDataCopySliceGM2UB {
public:
    __aicore__ inline KernelDataCopySliceGM2UB()
    {}
    __aicore__ inline void Init(__gm__ uint8_t* dstGm, __gm__ uint8_t* srcGm)
    {
        AscendC::SliceInfo srcSliceInfoIn[] = {{16, 70, 7, 3, 87}, {0, 2, 1, 1, 3}};// As shown by the example input data, startIndex is 16, endIndex is 70, burstLen is 3, stride is 7, and shapeValue is 87.
        AscendC::SliceInfo dstSliceInfoIn[] = {{0, 47, 0, 3, 48}, {0, 1, 0, 1, 2}};// UB space is insufficient. You are advised to set the stride to 0.
        uint32_t dimValueIn = 2;
        uint32_t dstDataSize = 96;
        uint32_t srcDataSize = 261;
        dimValue = dimValueIn;
        for (uint32_t i = 0; i < dimValueIn; i++) {
            srcSliceInfo[i].startIndex = srcSliceInfoIn[i].startIndex;
            srcSliceInfo[i].endIndex = srcSliceInfoIn[i].endIndex;
            srcSliceInfo[i].stride = srcSliceInfoIn[i].stride;
            srcSliceInfo[i].burstLen = srcSliceInfoIn[i].burstLen;
            srcSliceInfo[i].shapeValue = srcSliceInfoIn[i].shapeValue;
            dstSliceInfo[i].startIndex = dstSliceInfoIn[i].startIndex;
            dstSliceInfo[i].endIndex = dstSliceInfoIn[i].endIndex;
            dstSliceInfo[i].stride = dstSliceInfoIn[i].stride;
            dstSliceInfo[i].burstLen = dstSliceInfoIn[i].burstLen;
            dstSliceInfo[i].shapeValue = dstSliceInfoIn[i].shapeValue;
        }
        srcGlobal.SetGlobalBuffer((__gm__ T *)srcGm);
        dstGlobal.SetGlobalBuffer((__gm__ T *)dstGm);
        pipe.InitBuffer(inQueueSrcVecIn, 1, dstDataSize * sizeof(T));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<T> srcLocal = inQueueSrcVecIn.AllocTensor<T>();
        AscendC::DataCopy(srcLocal, srcGlobal,  dstSliceInfo, srcSliceInfo, dimValue);
        inQueueSrcVecIn.EnQue(srcLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<T> srcOutLocal = inQueueSrcVecIn.DeQue<T>();
        AscendC::DataCopy(dstGlobal, srcOutLocal, dstSliceInfo, dstSliceInfo, dimValue);
        inQueueSrcVecIn.FreeTensor(srcOutLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrcVecIn;
    AscendC::GlobalTensor<T> srcGlobal;
    AscendC::GlobalTensor<T> dstGlobal;
    AscendC::SliceInfo dstSliceInfo[K_MAX_DIM];
    AscendC::SliceInfo srcSliceInfo[K_MAX_DIM]; // K_MAX_DIM = 8
    uint32_t dimValue;
};
extern "C" __global__ __aicore__ void kernel_data_copy_slice_out2ub(__gm__ uint8_t* src_gm, __gm__ uint8_t* dst_gm)
{
    KernelDataCopySliceGM2UB<TYPE> op;
    op.Init(dst_gm, src_gm);
    op.Process();
}

For details about the result example, see Figure 1.