DataCopyPad(ISASI)

Function Usage

Enables data non-aligned movement. The supported channels are as follows:

GM->VECIN/VECOUT

VECIN/VECOUT->GM

VECIN/VECOUT->TSCM

During data movement from GM to VECIN/VECOUT, you can pad the data as required.

Prototype

  • dataCopyParams is of the DataCopyExtParams type. It supports a larger value range for parameters such as the operand stride compared to the DataCopyParams type.
    • Path: GM -> VECIN/VECOUT
      1
      2
      template <typename T>
      __aicore__ inline void DataCopyPad(const LocalTensor<T> &dstLocal, const GlobalTensor<T> &srcGlobal, const DataCopyExtParams &dataCopyParams, const DataCopyPadExtParams<T> &padParams)
      
    • Path: VECIN/VECOUT -> GM
      1
      2
      template <typename T>
      __aicore__ inline void DataCopyPad(const GlobalTensor<T> &dstGlobal, const LocalTensor<T> &srcLocal, const DataCopyExtParams &dataCopyParams)
      
    • Channel: VECIN/VECOUT-> TSCM. The actual channel is VECIN/VECOUT->GM->TSCM.

      The Atlas 200I/500 A2 inference products does not support this API prototype.

      1
      2
      template <typename T>
      __aicore__ inline void DataCopyPad(const LocalTensor<T> &dstLocal, const LocalTensor<T> &srcLocal, const DataCopyExtParams &dataCopyParams, const Nd2NzParams &nd2nzParams)
      
  • dataCopyParams is of the DataCopyParams type.
    • Path: GM -> VECIN/VECOUT
      1
      2
      template<typename T>
      __aicore__ inline void DataCopyPad(const LocalTensor<T>& dstLocal, const GlobalTensor<T>& srcGlobal, const DataCopyParams& dataCopyParams, const DataCopyPadParams& padParams)
      
    • Path: VECIN/VECOUT -> GM
      1
      2
      template<typename T>
      __aicore__ inline void DataCopyPad(const GlobalTensor<T>& dstGlobal, const LocalTensor<T>& srcLocal,const DataCopyParams& dataCopyParams)
      
    • Channel: VECIN/VECOUT-> TSCM. The actual channel is VECIN/VECOUT->GM->TSCM.

      The Atlas 200I/500 A2 inference products does not support this API prototype.

      1
      2
      template<typename T>
      __aicore__ inline void DataCopyPad(const LocalTensor<T>& dstLocal, const LocalTensor<T>& srcLocal, const DataCopyParams& dataCopyParams, const Nd2NzParams& nd2nzParams)
      

Parameters

Table 1 Parameters in the template

Parameter

Description

T

Data type of the operand and paddingValue (data value to be padded).

For the Atlas A2 training products / Atlas A2 inference products , the supported data type is half/bfloat16_t/int16_t/uint16_t/float/int32_t/uint32_t/int8_t/uint8_t/int64_t/uint64_t/double.

For the Atlas A3 training products / Atlas A3 inference products , the supported data type is half/bfloat16_t/int16_t/uint16_t/float/int32_t/uint32_t/int8_t/uint8_t/int64_t/uint64_t/double.

For the Atlas 200I/500 A2 inference products , the supported data type is int8_t/uint8_t/half/bfloat16_t/int16_t/uint16_t/float/int32_t/uint32_t.

Table 2 Parameters

Parameter

Input/Output

Description

dstLocal/dstGlobal

Output

Destination operand of type LocalTensor or GlobalTensor.

The start address of LocalTensor must be 32-byte aligned.

The start address of GlobalTensor does not require alignment.

srcLocal/srcGlobal

Input

Source operand of type LocalTensor or GlobalTensor.

The start address of LocalTensor must be 32-byte aligned.

The start address of GlobalTensor does not require alignment.

dataCopyParams

Input

Movement parameter.

  • DataCopyExtParams type. The definition is as follows. For details about the parameters, see Table 3.
    1
    2
    3
    4
    5
    6
    7
    struct DataCopyExtParams {
        uint16_t blockCount = 0;
        uint32_t blockLen = 0;
        uint32_t srcStride = 0;
        uint32_t dstStride = 0;
        uint32_t rsv = 0; // reserved information
    };
    
  • DataCopyParams type. The definition is as follows. For details about the parameters, see Table 4.
    1
    2
    3
    4
    5
    6
    struct DataCopyParams {
        uint16_t blockCount = 0;
        uint16_t blockLen = 0;
        uint16_t srcStride = 0;
        uint16_t dstStride = 0;
    };
    

padParams

Input

When moving data from GM to VECIN/VECOUT, pad data on the left or right of the data to be moved as required. padParams is used to control data padding. It is of the DataCopyPadExtParams type and is defined as follows. For details about the parameters, see Table 5.

1
2
3
4
5
6
template <typename T> struct DataCopyPadExtParams {
    bool isPad = false;
    uint8_t leftPadding = 0;
    uint8_t rightPadding = 0;
    T paddingValue = 0;
};

nd2nzParams

Input

When moving data from VECIN/VECOUT to TSCM, you can convert the data format from ND to NZ. nd2nzParams is used to control data format conversion. It is of the Nd2NzParams type and is defined as follows. For details about the parameters, see Table 7.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
struct Nd2NzParams {
    uint16_t ndNum = 0;
    uint16_t nValue = 0;
    uint16_t dValue = 0;
    uint16_t srcNdMatrixStride = 0;
    uint16_t srcDValue = 0;
    uint16_t dstNzC0Stride = 0;
    uint16_t dstNzNStride = 0;
    uint16_t dstNzMatrixStride = 0;
};

Note: ndNum of Nd2NzParams can only be set to 1.

Table 3 Parameters in the DataCopyExtParams structure

Parameter

Meaning

blockCount

Number of data chunks. The data type is uint16_t, and blockCount ∈ [1, 4095].

blockLen

Length of each data chunk, in byte, which supports non-aligned movement. The length unit of each data chunk is byte. The data type is uint32_t, and blockLen ∈ [1, 2097151].

srcStride

Source operand, that is, the interval between adjacent data chunks (the interval between the tail of the previous data chunk and the head of the next data chunk). If the logical position of the source operand is VECIN/VECOUT, the unit is data block (32 bytes). If the logical position of the source operand is GM, the unit is byte. The data type is uint32_t. The value of srcStride cannot exceed the value range of this data type.

dstStride

Destination operand, that is, the interval between adjacent data chunks (the interval between the tail of the previous data chunk and the head of the next data chunk). If the logical position of the destination operand is VECIN/VECOUT, the unit is data block (32 bytes). If the logical position of the destination operand is GM, the unit is byte. The data type is uint32_t. The value of dstStride cannot exceed the value range of this data type.

rsv

Reserved

Table 4 Parameters in the DataCopyParams structure

Parameter

Meaning

blockCount

Number of data chunks. The data type is uint16_t, and blockCount ∈ [1, 4095].

blockLen

Length of each data chunk, in byte, which supports non-aligned movement. The length unit of each data chunk is byte. The data type is uint16_t. The value of blockLen cannot exceed the value range of this data type.

srcStride

Source operand, that is, the interval between adjacent data chunks (the interval between the tail of the previous data chunk and the head of the next data chunk). If the logical position of the source operand is VECIN/VECOUT, the unit is data block (32 bytes). If the logical position of the source operand is GM, the unit is byte. The data type is uint16_t. The value of srcStride cannot exceed the value range of this data type.

dstStride

Destination operand, that is, the interval between adjacent data chunks (the interval between the tail of the previous data chunk and the head of the next data chunk). If the logical position of the destination operand is VECIN/VECOUT, the unit is data block (32 bytes). If the logical position of the destination operand is GM, the unit is byte. The data type is uint16_t. The value of dstStride cannot exceed the value range of this data type.

Table 5 Parameters in the DataCopyPadExtParams structure

Parameter

Meaning

isPad

Whether to pad custom data. The options are as follows:

true: Set a padding value.

false: No padding value is specified, and a random value is used.

leftPadding

Range of the data to be padded on the left of the data chunk. The unit is the number of elements.

The values of leftPadding and rightPadding cannot exceed 32 bytes.

rightPadding

Range of the data to be padded on the right of the data chunk. The unit is the number of elements.

The values of leftPadding and rightPadding cannot exceed 32 bytes.

paddingValue

Values of the data to be padded on the left and right. The values must be within the byte range occupied by the data.

The data type must be the same as that of the source operand. The data type is T.

When the data type length is 64 bits, this parameter can only be set to 0.

  • GM -> VECIN/VECOUT
    • Description
      • When blockLen + leftPadding + rightPadding meets the 32-byte alignment requirement, if isPad is false, the values padded on the left and right sides are random numbers by default; if not, the values padded on the left and right sides are configured padding values.
      • When blockLen + leftPadding + rightPadding does not meet the 32-byte alignment requirement, the framework pads some dummy data to ensure the 32-byte alignment. When both leftPadding and rightPadding are 0, the first element value of the data chunk to be transferred is padded by default. When leftPadding or rightPadding is not 0, if isPad is set to false, the data values padded on the left and right sides and dummy values are random numbers; if not, the data values padded on the left and right sides and dummy values are configured padding values.
    • Configuration example 1
      • The value of blockLen is 64, meaning each data chunk contains 64 bytes. The value of srcStride is 1. Because the logical position of the source operand is GM and the unit of srcStride is byte, there is 1-byte interval between adjacent data chunks of the source operand. The value of dstStride is 1. Because the logical position of the destination operand is VECIN/VECOUT and the unit of dstStride is a 32-byte data block, there is an interval of one data block between adjacent data chunks of the destination operand.
      • When blockLen + leftPadding + rightPadding meets the 32-byte alignment requirement, if isPad is set to false, the data values padded on the left and right sides are random numbers by default; if not, the data values padded on the left and right sides are configured padding values. In this example, both leftPadding and rightPadding are 0, indicating that padding is not performed.
      • When blockLen + leftPadding + rightPadding does not meet the 32-byte alignment requirement, the framework pads some dummy data to ensure the 32-byte alignment. When leftPadding or rightPadding is not 0, if isPad is set to false, the data values padded on the left and right sides and dummy values are random numbers; if not, the data values padded on the left and right sides are configured padding values.

    • Configuration example 2
      • The value of blockLen is 47, meaning each data chunk contains 47 bytes. The value of srcStride is 1, indicating that there is 1-byte interval between adjacent data chunks of the source operand. The value of dstStride is 1, indicating that there is an interval of one data block between adjacent data chunks of the destination operand.
      • When blockLen + leftPadding + rightPadding does not meet the 32-byte alignment requirement, and both leftPadding and rightPadding are 0, the first element value of the data chunk to be transferred is padded with dummy values by default.
      • When blockLen + leftPadding + rightPadding does not meet the 32-byte alignment requirement, and leftPadding or rightPadding is not 0, if isPad is false, the data values padded on the left and right sides and dummy values are random numbers; if not, the data values padded on the left and right sides and dummy values are configured padding values.

  • VECIN/VECOUT -> GM
    • When blockLen of each data chunk is 32-byte aligned, an example of DataCopyParams to be input is shown in the following figure. The value of blockLen is 64, meaning each data chunk contains 64 bytes. The value of srcStride is 1. Because the logical position of the source operand is VECIN/VECOUT and the unit of srcStride is a 32-byte data block, there is one data block between adjacent data blocks of the source operand. The value of dstStride is 1. Because the logical position of the destination operand is GM and the unit of dstStride is byte, there is 1-byte interval between adjacent data blocks of the destination operand.

    • requires 32-byte alignment, but blockLen of each data chunk does not meet the requirement. In this case, when the data chunk is moved out, the framework automatically pads some dummy data to ensure alignment. When the data chunk is moved to GM, the dummy data is automatically deprecated. The following figure shows an example of DataCopyParams to be input in this scenario and the principle of padding dummy data. The value of blockLen is 47, meaning each data chunk contains 47 bytes, which is not 32-byte aligned. The value of srcStride is 1, indicating that there is an interval of one data block between adjacent data blocks of the source operand. The value of dstStride is 1, indicating that there is a 1-byte interval between adjacent data blocks of the destination operand. When data is moved out, the framework automatically pads 17-byte dummy data to ensure alignment. When data is moved to GM, the padded dummy data is automatically deprecated.

  • VECIN/VECOUT->TSCM

    Note: The internal implementation involves the communication between the AIC and AIV, and the actual transfer path is VECIN/VECOUT -> GM -> TSCM. Sending communication messages causes overheads and affects the performance.

    Figure 1 VECIN/VECOUT -> TSCM data transfer illustrates the process of transferring data from VECIN/VECOUT to GM and then to TSCM. In the example, the data type is half, with a single data block (32 bytes) containing 16 half elements, and the source operands A1 to A6, B1 to B6, and C1 to C6 indicate the data to be transferred.

    • During the transfer from VECIN/VECOUT to GM, the data storage format remains unchanged as ND.
      • blockCount indicates the number of data chunks. Set it to 3.
      • blockLen indicates the size (in bytes) of a data chunk. Set it to 192 (which is 6 x 32).
      • srcStride indicates the interval between adjacent data chunks of the source operand (the interval between the tail of the previous data chunk and the head of the next data chunk). The logical position of the source operand is VECIN/VECOUT, the unit is data block, and two data chunks (A1 to A6 and B1 to B6) are separated by one A7. Therefore, set srtStride to 1.
      • dstStride indicates the interval between adjacent data chunks of the destination operand (the interval between the tail of the previous data chunk and the head of the next data chunk). The logical position of the destination operand is GM, the unit is byte, and two data chunks (A1 to A6 and B1 to B6) are separated by two empty data blocks. Therefore, set dstStride to 64Byte.
    • During the transfer from GM to TSCM, the data storage format is converted from ND to NZ.
      • ndNum is fixed to 1. That is, A1 to A6, B1 to B6, and C1 to C6 are considered as an entire ND matrix.
      • nValue indicates the number of rows in the ND matrix, that is, 3.
      • dValue indicates the number of elements in a row of the ND matrix, that is, 96 (which is 6 x 16).
      • srcNdMatrixStride indicates the distance between adjacent ND matrices. Because only one ND matrix is involved, srcNdMatrixStride can be set to 0.
      • srcDValue indicates a quantity of elements between the xth row and the (x+1)th row of the ND matrix. For example, a distance between A1 and B1 is eight data blocks, and there are 128 (which is 8 x 16) elements.
      • dstNzC0Stride indicates the number of data blocks between adjacent data blocks in a same row of src in the NZ matrix, for example, a distance between A1 and A2, that is, seven data blocks (A1 + blank + B1 + blank + C1 + blank x 2).
      • dstNzNStride indicates the number of data blocks between adjacent rows of src in the NZ matrix, for example, the distance between A1 and B1, that is, two data blocks (A1 + blank).
      • dstNzMatrixStride indicates the number of elements between adjacent NZ matrices. Because only one NZ matrix is involved, dstNzMatrixStride can be set to 1.
    Figure 1 VECIN/VECOUT -> TSCM data transfer

Returns

None

Availability

Atlas A2 training products / Atlas A2 inference products

Atlas A3 training products / Atlas A3 inference products

Atlas 200I/500 A2 inference products

Constraints

The values of leftPadding and rightPadding cannot exceed 32 bytes.

Example

This example implements the non-aligned movement process of GM -> VECIN -> GM.
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#include "kernel_operator.h"

class TestDataCopyPad {
public:
    __aicore__ inline TestDataCopyPad() {}
    __aicore__ inline void Init(__gm__ uint8_t* srcGm, __gm__ uint8_t* dstGm)
    {
        srcGlobal.SetGlobalBuffer((__gm__ half *)srcGm);
        dstGlobal.SetGlobalBuffer((__gm__ half *)dstGm);
        pipe.InitBuffer(inQueueSrc, 1, 32 * sizeof(half));
        pipe.InitBuffer(outQueueDst, 1, 32 * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>();
        AscendC::DataCopyExtParams copyParams{1, 20 * sizeof(half), 0, 0, 0}; // The last parameter of DataCopyExtParams is reserved.
        AscendC::DataCopyPadExtParams<half> padParams{true, 0, 2, 0};
        AscendC::DataCopyPad(srcLocal, srcGlobal, copyParams, padParams); // Move 40 bytes from GM to VECIN.
        inQueueSrc.EnQue<half>(srcLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>();
        AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
        AscendC::Adds(dstLocal, srcLocal, scalar, 20);
        outQueueDst.EnQue(dstLocal);
        inQueueSrc.FreeTensor(srcLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>();
        AscendC::DataCopyExtParams copyParams{1, 20 * sizeof(half), 0, 0, 0};
        AscendC::DataCopyPad(dstGlobal, dstLocal, copyParams); // Move 40 bytes from VECIN to GM.
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<half> srcGlobal;
    AscendC::GlobalTensor<half> dstGlobal;
    AscendC::DataCopyPadExtParams<half> padParams;
    AscendC::DataCopyExtParams copyParams;
    half scalar = 0;
};

extern "C" __global__ __aicore__ void kernel_data_copy_pad_kernel(__gm__ uint8_t* src_gm, __gm__ uint8_t* dst_gm)
{
    TestDataCopyPad op;
    op.Init(src_gm, dst_gm);
    op.Process();
}
Result example:
Input data (src0Global): [1 2 3 ... 32]
Output data (dstGlobal):[1 2 3 ... 20]