Duplicate
Applicability
|
Product |
Supported/Unsupported |
|---|---|
|
|
√ |
|
|
√ |
|
|
√ |
|
|
√ |
|
|
x |
|
|
√ |
Function Usage
Copies a variable or immediate value for multiple times and fills the vector.
Prototype
- Computation of the first n data elements of a tensor
- Scalar as the source operand
1 2
template <typename T> __aicore__ inline void Duplicate(const LocalTensor<T>& dst, const T& scalarValue, const int32_t& count)
- Scalar as the source operand
- High-dimensional tensor sharding computation
- Bitwise mask mode
1 2
template <typename T, bool isSetMask = true> __aicore__ inline void Duplicate(const LocalTensor<T>& dst, const T& scalarValue, uint64_t mask[], const uint8_t repeatTime, const uint16_t dstBlockStride, const uint8_t dstRepeatStride)
- Contiguous mask mode
1 2
template <typename T, bool isSetMask = true> __aicore__ inline void Duplicate(const LocalTensor<T>& dst, const T& scalarValue, uint64_t mask, const uint8_t repeatTime, const uint16_t dstBlockStride, const uint8_t dstRepeatStride)
- Bitwise mask mode
Parameters
|
Parameter |
Description |
|---|---|
|
T |
Operand data type. |
|
isSetMask |
Indicates whether to set mask inside the API.
|
|
Parameter |
Input/Output |
Meaning |
|---|---|---|
|
dst |
Output |
Destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. |
|
scalarValue |
Input |
Source operand to be copied. The data type must be the same as that of the elements in dst. |
|
count |
Input |
Number of elements involved in the computation. |
|
mask/mask[] |
Input |
The mask parameter is used to control the elements involved in computation in each iteration.
|
|
repeatTime |
Input |
The Vector Unit reads 8 data blocks (32 bytes each and 256 bytes in total) of contiguous data each time, and has to go through several repeats before all data can be read and computed. repeatTime indicates the number of repeats. |
|
dstBlockStride |
Input |
Address stride of the vector destination operand between different data blocks in a single repeat |
|
dstRepeatStride |
Input |
Address stride of the vector destination operand for the same data block between adjacent repeats |
Restrictions
- For details about the operand address alignment requirements, see General Address Alignment Restrictions.
Returns
None
Examples
This example shows only part of the code involved in the computation process (Compute). To run the code, copy the code block and paste it to the corresponding position in the Compute function in Template Samples.
- Example of high-dimensional tensor sharding computation (contiguous mask mode)
1 2 3 4 5 6
uint64_t mask = 128; half scalar = 18.0; // repeatTime = 2, 128 elements one repeat, 256 elements total // dstBlkStride = 1, no gap between blocks in one repeat // dstRepStride = 8, no gap between repeats AscendC::Duplicate(dstLocal, scalar, mask, 2, 1, 8 );
- Example of high-dimensional tensor sharding computation (bitwise mask mode)
1 2 3 4 5 6
uint64_t mask[2] = { UINT64_MAX, UINT64_MAX }; half scalar = 18.0; // repeatTime = 2, 128 elements one repeat, 256 elements total // dstBlkStride = 1, no gap between blocks in one repeat // dstRepStride = 8, no gap between repeats AscendC::Duplicate(dstLocal, scalar, mask, 2, 1, 8 );
- Example of computing the first n pieces of data in a tensor, where the source operand is a scalar
1 2
half inputVal(18.0); AscendC::Duplicate<half>(dstLocal, inputVal, srcDataSize);
scalar: 18.0 dstLocal: [18.0 18.0 18.0 ... 18.0 18.0]
More Samples
You can refer to the following examples to learn how to use the high-dimensional tensor sharding computation APIs of the Duplicate instruction to perform more flexible operations and implement more advanced functions. This example shows only part of the code in the computation process. To run the code, copy and paste the code snippet to the corresponding position of the Compute function in Template Samples.
- Use the contiguous mask mode of a high-dimensional tensor sharding computation API to implement discontinuous data calculation.
1 2 3 4 5
uint64_t mask = 64; // Only the first 64 bits are calculated in each repeat. half scalar = 18.0; // repeatTime = 2, 128 elements one repeat, 256 elements total // dstBlkStride = 1, dstRepStride = 8 AscendC::Duplicate(dstLocal, scalar, mask, 2, 1, 8 );
Result example:
[18.0 18.0 18.0 ... 18.0 undefined ... undefined 18.0 18.0 18.0 ... 18.0 undefined ... undefined ](The length of each segment of the computed result or undefined data is 64.)
- Use the bitwise mask mode of a high-dimensional tensor sharding computation API to implement discontinuous data calculation.
1 2 3 4 5 6
uint64_t mask[2] = { UINT64_MAX, 0 }; // mask[0] is set to max, mask[1] is set to empty, and only the first 64 bits are calculated each time. half scalar = 18.0; // repeatTime = 2, 128 elements one repeat, 512 elements total // dstBlkStride = 1, no gap between blocks in one repeat // dstRepStride = 8, no gap between repeats AscendC::Duplicate(dstLocal, scalar, mask, 2, 1, 8);
Result example:Input (src0Local): [1.0 2.0 3.0... 256.0] Input (src1Local): half scalar = 18.0; Output (dstLocal): [18.0 18.0 18.0 ... 18.0 undefined ... undefined 18.0 18.0 18.0 ... 18.0 undefined ... undefined](The length of each segment of the computed result or undefined data is 64.)
- Set the repeatStride parameter of a high-dimensional tensor sharding computation API to implement discontinuous data computation.
1 2 3 4 5 6
uint64_t mask = 128; half scalar = 18.0; // repeatTime = 1, 128 elements one repeat, 256 elements total // dstBlkStride = 2, 1 block gap between blocks in one repeat // dstRepStride = 0, repeatTime = 1 AscendC::Duplicate(dstLocal, scalar, mask, 1, 2, 0);
Result example:Input (src0Local): [1.0 2.0 3.0... 256.0] Input (src1Local): half scalar = 18.0; Output (dstLocal): [18.0 18.0 18.0 ... 18.0 undefined ... undefined 18.0 18.0 18.0 ... 18.0 undefined ... undefined 18.0 18.0 18.0 ... 18.0 undefined ... undefined 18.0 18.0 18.0 ... 18.0 undefined ... undefined 18.0 18.0 18.0 ... 18.0 undefined ... undefined 18.0 18.0 18.0 ... 18.0 undefined ... undefined 18.0 18.0 18.0 ... 18.0 undefined ... undefined 18.0 18.0 18.0 ... 18.0 undefined ... undefined](The length of each segment of the computed result is 16.)
- Set the repeatStride parameter of a high-dimensional tensor sharding computation API to implement discontinuous data computation.
1 2 3 4 5 6
uint64_t mask = 64; half scalar = 18.0; // repeatTime = 2, 128 elements one repeat, 256 elements total // dstBlkStride = 1, no gap between blocks in one repeat // dstRepStride = 12, 4 blocks gap between repeats AscendC::Duplicate(dstLocal, scalar, mask, 2, 1, 12);
Result example:Input (src0Local): [1.0 2.0 3.0... 256.0] Input (src1Local): half scalar = 18.0; Output (dstLocal): [18.0 18.0 18.0 ... 18.0 undefined ... undefined 18.0 18.0 18.0 ... 18.0](The length of each segment of the computed result is 64, and that of undefined data is 128.)
Template Samples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
#include "kernel_operator.h" class KernelDuplicate { public: __aicore__ inline KernelDuplicate() {} __aicore__ inline void Init(__gm__ uint8_t* src, __gm__ uint8_t* dstGm) { srcGlobal.SetGlobalBuffer((__gm__ half*)src); dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm); pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(half)); pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(half)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>(); AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize); inQueueSrc.EnQue(srcLocal); } __aicore__ inline void Compute() { AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>(); AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>(); half inputVal(18.0); AscendC::Duplicate<half>(dstLocal, inputVal, srcDataSize); outQueueDst.EnQue<half>(dstLocal); inQueueSrc.FreeTensor(srcLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>(); AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize); outQueueDst.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc; AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst; AscendC::GlobalTensor<half> srcGlobal, dstGlobal; int srcDataSize = 256; int dstDataSize = 256; }; extern "C" __global__ __aicore__ void duplicate_kernel(__gm__ uint8_t* src, __gm__ uint8_t* dstGm) { KernelDuplicate op; op.Init(src, dstGm); op.Process(); } |