Duplicate
Function Usage
Copies a variable or an immediate for multiple times and fill it in the vector (PAR indicates the number of elements that can be processed by a vector unit in an iteration.):

Prototype
- Computation of the first n data elements of a tensor
1 2
template <typename T> void Duplicate(const LocalTensor<T>& dstLocal, const T& scalarValue, const int32_t& calCount)
- High-dimensional tensor sharding computation
- Bitwise mask mode
1 2
template <typename T, bool isSetMask = true> void Duplicate(const LocalTensor<T>& dstLocal, const T& scalarValue, uint64_t mask[], const uint8_t repeatTimes, const uint16_t dstBlockStride, const uint8_t dstRepeatStride)
- Contiguous mask mode
1 2
template <typename T, bool isSetMask = true> void Duplicate(const LocalTensor<T>& dstLocal, const T& scalarValue, uint64_t mask, const uint8_t repeatTimes, const uint16_t dstBlockStride, const uint8_t dstRepeatStride)
- Bitwise mask mode
Parameters
|
Parameter |
Description |
|---|---|
|
T |
Operand data type. For the |
|
isSetMask |
Indicates whether to set mask inside the API.
|
|
Parameter |
Input/Output |
Meaning |
|---|---|---|
|
dstLocal |
Output |
Destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. |
|
scalarValue |
Input |
A variable or an immediate, for the source operand to be copied. The data type must be the same as that of the element in dstLocal. |
|
calCount |
Input |
Number of elements of the input data. |
|
mask |
Input |
mask is used to control the elements that participate in computation in each iteration.
|
|
repeatTimes |
Input |
The Vector Unit reads 8 data blocks (32 bytes each and 256 bytes in total) of contiguous data each time, and has to go through several repeats before all data can be read and computed. repeatTimes indicates the number of repeats (iterations). |
|
dstBlockStride |
Input |
Address stride of the vector destination operand between different data blocks in a single iteration |
|
dstRepeatStride |
Input |
Address stride of the vector destination operand for the same data block between adjacent iterations |
Availability
Precautions
- For details about the alignment requirements of the operand address offset, see General Restrictions.
- Ensure that the immediate data does not exceed the size range corresponding to the element data type in dstLocal.
Returns
None
Example
This example shows only part of the code used in the computation process. To run the sample code, copy the code snippet and replace the Compute function in Template Sample.
- Example of high-dimensional tensor sharding computation (contiguous mask mode)
1 2 3 4 5 6
uint64_t mask = 128; half scalar = 18.0; // repeatTimes = 2, 128 elements one repeat, 256 elements total // dstBlkStride = 1, no gap between blocks in one repeat // dstRepStride = 8, no gap between repeats AscendC::Duplicate(dstLocal, scalar, mask, 2, 1, 8 );
- Example of high-dimensional tensor sharding computation (bitwise mask mode)
1 2 3 4 5 6
uint64_t mask[2] = { UINT64_MAX, UINT64_MAX }; half scalar = 18.0; // repeatTimes = 2, 128 elements one repeat, 256 elements total // dstBlkStride = 1, no gap between blocks in one repeat // dstRepStride = 8, no gap between repeats AscendC::Duplicate(dstLocal, scalar, mask, 2, 1, 8 );
- Example of computing the first n data elements of a tensor
1 2
half inputVal(18.0); AscendC::Duplicate<half>(dstLocal, inputVal, srcDataSize);
Input: [0 1.0 2.0... 254.0 255.0] // The input data is not concerned and will be overwritten by Duplicate. Output: [18.0 18.0 18.0... 18.0 18.0]
More Examples
You can refer to the following examples to learn how to use the high-dimensional tensor sharding computation APIs of the Duplicate instruction to perform more flexible operations and implement more advanced functions. This example shows only part of the code used in the computation process. To run the example code, copy the code snippet and replace the content in bold of the Compute function in the example template. (Pay attention to the data type.)
- Use the contiguous mask mode of a high-dimensional tensor sharding computation API to implement discontinuous data computation.
1 2 3 4 5
uint64_t mask = 64; // Only the first 64 bits are computed in each iteration. half scalar = 18.0; // repeatTimes = 2, 128 elements one repeat, 256 elements total // dstBlkStride = 2, dstRepStride = 8 AscendC::Duplicate(dstLocal, scalar, mask, 2, 1, 8 );
Result example:
1 2
[18.0 18.0 18.0 ... 18.0 undefined ... undefined 18.0 18.0 18.0 ... 18.0 undefined ... undefined ](The length of each segment of the computed result or undefined data is 64.)
- Use the bitwise mask mode of a high-dimensional tensor sharding computation API to implement discontinuous data computation.
1 2 3 4 5 6
uint64_t mask[2] = { UINT64_MAX, 0 }; // mask[0] is set to max, mask[1] is set to empty, and only the first 64 bits are computed each time. half scalar = 18.0; // repeatTimes = 2, 128 elements one repeat, 512 elements total // dstBlkStride = 1, no gap between blocks in one repeat // dstRepStride = 8, no gap between repeats AscendC::Duplicate(dstLocal, scalar, mask, 2, 1, 8);
Result example:Input (src0Local): [1.0 2.0 3.0... 256.0] Input (src1Local): half scalar = 18.0; Output (dstLocal): [18.0 18.0 18.0 ... 18.0 undefined ... undefined 18.0 18.0 18.0 ... 18.0 undefined ... undefined](The length of each segment of the computed result or undefined data is 64.)
- Set the dataBlockStride parameter of a high-dimensional tensor sharding computation API to implement discontinuous data computation.
1 2 3 4 5 6
uint64_t mask = 128; half scalar = 18.0; // repeatTimes = 1, 128 elements one repeat, 256 elements total // dstBlkStride = 2, 1 block gap between blocks in one repeat // dstRepStride = 0, repeatTimes = 1 AscendC::Duplicate(dstLocal, scalar, mask, 1, 2, 0);
Result example:Input (src0Local): [1.0 2.0 3.0... 256.0] Input (src1Local): half scalar = 18.0; Output (dstLocal): [18.0 18.0 18.0 ... 18.0 undefined ... undefined 18.0 18.0 18.0 ... 18.0 undefined ... undefined 18.0 18.0 18.0 ... 18.0 undefined ... undefined 18.0 18.0 18.0 ... 18.0 undefined ... undefined 18.0 18.0 18.0 ... 18.0 undefined ... undefined 18.0 18.0 18.0 ... 18.0 undefined ... undefined 18.0 18.0 18.0 ... 18.0 undefined ... undefined 18.0 18.0 18.0 ... 18.0 undefined ... undefined](The length of each segment of the computed result is 16.)
- Set the repeatStride parameter of a high-dimensional tensor sharding computation API to implement discontinuous data computation.
1 2 3 4 5 6
uint64_t mask = 64; half scalar = 18.0; // repeatTimes = 2, 128 elements one repeat, 256 elements total // dstBlkStride = 1, no gap between blocks in one repeat // dstRepStride = 12, 4 blocks gap between repeats AscendC::Duplicate(dstLocal, scalar, mask, 2, 1, 12);
Result example:Input (src0Local): [1.0 2.0 3.0... 256.0] Input (src1Local): half scalar = 18.0; Output (dstLocal): [18.0 18.0 18.0 ... 18.0 undefined ... undefined 18.0 18.0 18.0 ... 18.0](The length of each segment of the computed result is 64, and that of undefined data is 128.)
Template Sample
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
#include "kernel_operator.h" class KernelDuplicate { public: __aicore__ inline KernelDuplicate() {} __aicore__ inline void Init(__gm__ uint8_t* src, __gm__ uint8_t* dstGm) { srcGlobal.SetGlobalBuffer((__gm__ half*)src); dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm); pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(half)); pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(half)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>(); AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize); inQueueSrc.EnQue(srcLocal); } __aicore__ inline void Compute() { AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>(); AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>(); half inputVal(18.0); AscendC::Duplicate<half>(dstLocal, inputVal, srcDataSize); outQueueDst.EnQue<half>(dstLocal); inQueueSrc.FreeTensor(srcLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>(); AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize); outQueueDst.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc; AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst; AscendC::GlobalTensor<half> srcGlobal, dstGlobal; int srcDataSize = 256; int dstDataSize = 256; }; extern "C" __global__ __aicore__ void duplicate_kernel(__gm__ uint8_t* src, __gm__ uint8_t* dstGm) { KernelDuplicate op; op.Init(src, dstGm); op.Process(); } |