Copy
Applicability
|
Product |
Supported/Unsupported |
|---|---|
|
|
√ |
|
|
√ |
|
|
√ |
|
|
x |
|
|
x |
|
|
x |
Functions
The transfer instruction between VECIN, VECCALC, and VECOUT supports the mask operation and DataBlock interval operation.
Prototype
- High-dimensional tensor sharding computation
- Bitwise mask mode
1 2
template <typename T, bool isSetMask = true> __aicore__ inline void Copy(const LocalTensor<T>& dst, const LocalTensor<T>& src, const uint64_t mask[], const uint8_t repeatTime, const CopyRepeatParams& repeatParams)
- Contiguous mask mode
1 2
template <typename T, bool isSetMask = true> __aicore__ inline void Copy(const LocalTensor<T>& dst, const LocalTensor<T>& src, const uint64_t mask, const uint8_t repeatTime, const CopyRepeatParams& repeatParams)
- Bitwise mask mode
Parameters
|
Parameter |
Description |
|---|---|
|
T |
Data type of the operand. For the For the For the |
|
isSetMask |
Indicates whether to set mask inside the API.
|
|
Parameter |
Input/Output |
Description |
|---|---|---|
|
dst |
Output |
Destination operand. The type is LocalTensor, and the supported TPosition is VECIN/VECCALC/VECOUT. The start address must be 32-byte aligned. |
|
src |
Input |
Source operand. The type is LocalTensor, and the supported TPosition is VECIN/VECCALC/VECOUT. The start address must be 32-byte aligned. The source operand must have the same data type as the destination operand. |
|
mask/mask[] |
Input |
The mask parameter is used to control the elements involved in computation in each iteration.
|
|
repeatTime |
Input |
Number of iteration repeats. The Vector Unit reads 256 bytes of contiguous data for computation each time. To read the complete data for processing, the unit needs to read the input data in multiple repeats. repeatTime indicates the number of repeats. For details about this parameter, see High-dimensional Sharding APIs. |
|
repeatParams |
Input |
Data structure that controls the operand address strides. It is of the CopyRepeatParams type. For details, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_data_copy.h. Replace ${INSTALL_DIR} with the actual CANN component directory. For details about the parameter description, see Table 3. |
Returns
None
Constraints
- The start addresses of the source and destination operands must be 32-byte aligned.
- Similar to the vector compute API, the copy API can be used together with the mask operation API. However, when the counter mode is used with the high-dimensional tiling compute API, it is different from the general counter mode. Specifically:
- General counter mode: The mask indicates the number of elements involved in the entire vector computation. The number of iterations does not take effect.
- In counter mode, the high-dimensional tiling API is used together with the copy API. The mask indicates the number of elements processed in each repeat, and the number of iterations takes effect. The following figure shows the details.
Example
This example shows only part of the code involved in the computation process. For the complete code, see Template Sample.
In this example, the operand data type is int16_t.
- Contiguous mask mode
1 2 3 4 5
uint64_t mask = 128; // repeatTime = 4, 128 elements one repeat, 512 elements total // dstStride, srcStride = 1, no gap between blocks in one repeat // dstRepStride, srcRepStride = 8, no gap between repeats AscendC::Copy(dstLocal, srcLocal, mask, 4, { 1, 1, 8, 8 });
Result example:
Input (srcLocal): [9 -2 8 ... 9] Output (dstLocal): [9 -2 8 ... 9]
- Bitwise mask mode
1 2 3 4 5
uint64_t mask[2] = { UINT64_MAX, UINT64_MAX }; // repeatTime = 4, 128 elements one repeat, 512 elements total // dstStride, srcStride = 1, no gap between blocks in one repeat // dstRepStride, srcRepStride = 8, no gap between repeats AscendC::Copy(dstLocal, srcLocal, mask, 4, { 1, 1, 8, 8 });
Result example:
Input (srcLocal): [9 -2 8 ... 9] Output (dstLocal): [9 -2 8 ... 9]
Template Sample
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 |
#include "kernel_operator.h" class KernelCopy { public: __aicore__ inline KernelCopy() {} __aicore__ inline void Init(__gm__ uint8_t* srcGm, __gm__ uint8_t* dstGm) { srcGlobal.SetGlobalBuffer((__gm__ int32_t*)srcGm); dstGlobal.SetGlobalBuffer((__gm__ int32_t*)dstGm); pipe.InitBuffer(inQueueSrc, 1, 512 * sizeof(int32_t)); pipe.InitBuffer(outQueueDst, 1, 512 * sizeof(int32_t)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<int32_t> srcLocal = inQueueSrc.AllocTensor<int32_t>(); AscendC::DataCopy(srcLocal, srcGlobal, 512); inQueueSrc.EnQue(srcLocal); } __aicore__ inline void Compute() { AscendC::LocalTensor<int32_t> srcLocal = inQueueSrc.DeQue<int32_t>(); AscendC::LocalTensor<int32_t> dstLocal = outQueueDst.AllocTensor<int32_t>(); uint64_t mask = 64; AscendC::Copy(dstLocal, srcLocal, mask, 4, { 1, 1, 8, 8 }); outQueueDst.EnQue<int32_t>(dstLocal); inQueueSrc.FreeTensor(srcLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<int32_t> dstLocal = outQueueDst.DeQue<int32_t>(); AscendC::DataCopy(dstGlobal, dstLocal, 512); outQueueDst.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc; AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst; AscendC::GlobalTensor<int32_t> srcGlobal, dstGlobal; }; extern "C" __global__ __aicore__ void copy_simple_kernel(__gm__ uint8_t* srcGm, __gm__ uint8_t* dstGm) { KernelCopy op; op.Init(srcGm, dstGm); op.Process(); } |
