Duplicate

Function Usage

Copies a variable or an immediate for multiple times and fill it in the vector (PAR indicates the number of elements that can be processed by a vector unit in an iteration.):

Prototype

  • Computation of the first n data elements of a tensor
    1
    2
    template <typename T>
    void Duplicate(const LocalTensor<T>& dstLocal, const T& scalarValue, const int32_t& calCount)
    
  • High-dimensional tensor sharding computation
    • Bitwise mask mode
      1
      2
      template <typename T, bool isSetMask = true>
      void Duplicate(const LocalTensor<T>& dstLocal, const T& scalarValue, uint64_t mask[], const uint8_t repeatTimes, const uint16_t dstBlockStride, const uint8_t dstRepeatStride)
      
    • Contiguous mask mode
      1
      2
      template <typename T, bool isSetMask = true>
      void Duplicate(const LocalTensor<T>& dstLocal, const T& scalarValue, uint64_t mask, const uint8_t repeatTimes, const uint16_t dstBlockStride, const uint8_t dstRepeatStride)
      

Parameters

Table 1 Parameters in the template

Parameter

Description

T

Operand data type.

For the Atlas Training Series Product , the supported data types are uint16_t, int16_t, half, uint32_t, int32_t, and float.

isSetMask

Indicates whether to set mask inside the API.

  • true: sets mask inside the API.
  • false: sets mask outside the API. Developers need to use the SetVectorMask API to set the mask value. In this mode, the mask value in the input parameter of this API must be set to MASK_PLACEHOLDER.
Table 2 Parameters

Parameter

Input/Output

Meaning

dstLocal

Output

Destination operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The start address of the LocalTensor must be 32-byte aligned.

scalarValue

Input

A variable or an immediate, for the source operand to be copied. The data type must be the same as that of the element in dstLocal.

calCount

Input

Number of elements of the input data.

mask

Input

mask is used to control the elements that participate in computation in each iteration.

  • Contiguous mode: indicates the number of contiguous elements that participate in computation. The value range is related to the operand data type. The maximum number of elements that can be processed in each iteration varies according to the data type. When the operand is 16-bit, mask ∈ [1, 128]. When the operand is 32-bit, mask ∈ [1, 64]. When the operand is 64-bit, mask ∈ [1, 32].
  • Bitwise mode: controls the elements that participate in computation by bit. If a bit is set to 1, the corresponding element participates in the computation. If a bit is set to 0, the corresponding element is masked in the computation. The parameter type is a uint64_t array whose length is 2.

    For example, if mask = [0, 8] and 8 = 0b1000, only the fourth element participates in computation.

    The parameter value range is related to the operand data type. The maximum number of elements that can be processed in each iteration varies according to the data type. When the operand is 16-bit, mask[0] and mask[1] ∈ [0, 264 -1] and cannot be 0 at the same time. When the operand is 32-bit, mask[1] is 0 and mask[0] ∈ (0, 264 – 1]. When the operand is 64-bit, mask[1] is 0 and mask[0] ∈ (0, 232 – 1].

repeatTimes

Input

The Vector Unit reads 8 data blocks (32 bytes each and 256 bytes in total) of contiguous data each time, and has to go through several repeats before all data can be read and computed. repeatTimes indicates the number of repeats (iterations).

dstBlockStride

Input

Address stride of the vector destination operand between different data blocks in a single iteration

dstRepeatStride

Input

Address stride of the vector destination operand for the same data block between adjacent iterations

Availability

Atlas Training Series Product

Precautions

  • For details about the alignment requirements of the operand address offset, see General Restrictions.
  • Ensure that the immediate data does not exceed the size range corresponding to the element data type in dstLocal.

Returns

None

Example

This example shows only part of the code used in the computation process. To run the sample code, copy the code snippet and replace the Compute function in Template Sample.

  • Example of high-dimensional tensor sharding computation (contiguous mask mode)
    1
    2
    3
    4
    5
    6
    uint64_t mask = 128;
    half scalar = 18.0;
    // repeatTimes = 2, 128 elements one repeat, 256 elements total
    // dstBlkStride = 1, no gap between blocks in one repeat
    // dstRepStride = 8, no gap between repeats
    AscendC::Duplicate(dstLocal, scalar, mask, 2, 1, 8 );
    
  • Example of high-dimensional tensor sharding computation (bitwise mask mode)
    1
    2
    3
    4
    5
    6
    uint64_t mask[2] = { UINT64_MAX, UINT64_MAX };
    half scalar = 18.0;
    // repeatTimes = 2, 128 elements one repeat, 256 elements total
    // dstBlkStride = 1, no gap between blocks in one repeat
    // dstRepStride = 8, no gap between repeats
    AscendC::Duplicate(dstLocal, scalar, mask, 2, 1, 8 );
    
  • Example of computing the first n data elements of a tensor
    1
    2
    half inputVal(18.0);
    AscendC::Duplicate<half>(dstLocal, inputVal, srcDataSize);
    
Result example:
Input: [0 1.0 2.0... 254.0 255.0] // The input data is not concerned and will be overwritten by Duplicate.
Output: [18.0 18.0 18.0... 18.0 18.0]

More Examples

You can refer to the following examples to learn how to use the high-dimensional tensor sharding computation APIs of the Duplicate instruction to perform more flexible operations and implement more advanced functions. This example shows only part of the code used in the computation process. To run the example code, copy the code snippet and replace the content in bold of the Compute function in the example template. (Pay attention to the data type.)

  • Use the contiguous mask mode of a high-dimensional tensor sharding computation API to implement discontinuous data computation.
    1
    2
    3
    4
    5
    uint64_t mask = 64;  // Only the first 64 bits are computed in each iteration.
    half scalar = 18.0;
    // repeatTimes = 2, 128 elements one repeat, 256 elements total
    // dstBlkStride = 2, dstRepStride = 8 
    AscendC::Duplicate(dstLocal, scalar, mask, 2, 1, 8 );
    

    Result example:

    1
    2
    [18.0 18.0 18.0 ... 18.0  undefined ... undefined 
     18.0 18.0 18.0 ... 18.0 undefined ... undefined ](The length of each segment of the computed result or undefined data is 64.)
    
  • Use the bitwise mask mode of a high-dimensional tensor sharding computation API to implement discontinuous data computation.
    1
    2
    3
    4
    5
    6
    uint64_t mask[2] = { UINT64_MAX, 0 }; // mask[0] is set to max, mask[1] is set to empty, and only the first 64 bits are computed each time.
    half scalar = 18.0;
    // repeatTimes = 2, 128 elements one repeat, 512 elements total
    // dstBlkStride = 1, no gap between blocks in one repeat
    // dstRepStride = 8, no gap between repeats
    AscendC::Duplicate(dstLocal, scalar, mask, 2, 1, 8);
    
    Result example:
    Input (src0Local): [1.0 2.0 3.0... 256.0]
    Input (src1Local): half scalar = 18.0;
    Output (dstLocal):
    [18.0 18.0 18.0 ... 18.0 undefined ... undefined
     18.0 18.0 18.0 ... 18.0 undefined ... undefined](The length of each segment of the computed result or undefined data is 64.)
  • Set the dataBlockStride parameter of a high-dimensional tensor sharding computation API to implement discontinuous data computation.
    1
    2
    3
    4
    5
    6
    uint64_t mask = 128;
    half scalar = 18.0;
    // repeatTimes = 1, 128 elements one repeat, 256 elements total
    // dstBlkStride = 2, 1 block gap between blocks in one repeat
    // dstRepStride = 0, repeatTimes = 1
    AscendC::Duplicate(dstLocal, scalar, mask, 1, 2, 0);
    
    Result example:
    Input (src0Local): [1.0 2.0 3.0... 256.0]
    Input (src1Local): half scalar = 18.0;
    Output (dstLocal):
    [18.0 18.0 18.0 ... 18.0 undefined ... undefined
     18.0 18.0 18.0 ... 18.0 undefined ... undefined
     18.0 18.0 18.0 ... 18.0 undefined ... undefined
     18.0 18.0 18.0 ... 18.0 undefined ... undefined
     18.0 18.0 18.0 ... 18.0 undefined ... undefined
     18.0 18.0 18.0 ... 18.0 undefined ... undefined
     18.0 18.0 18.0 ... 18.0 undefined ... undefined
     18.0 18.0 18.0 ... 18.0 undefined ... undefined](The length of each segment of the computed result is 16.)
  • Set the repeatStride parameter of a high-dimensional tensor sharding computation API to implement discontinuous data computation.
    1
    2
    3
    4
    5
    6
    uint64_t mask = 64;
    half scalar = 18.0;
    // repeatTimes = 2, 128 elements one repeat, 256 elements total
    // dstBlkStride = 1, no gap between blocks in one repeat
    // dstRepStride = 12, 4 blocks gap between repeats
    AscendC::Duplicate(dstLocal, scalar, mask, 2, 1, 12);
    
    Result example:
    Input (src0Local): [1.0 2.0 3.0... 256.0]
    Input (src1Local): half scalar = 18.0;
    Output (dstLocal):
    [18.0 18.0 18.0 ... 18.0 undefined ... undefined 18.0 18.0 18.0 ... 18.0](The length of each segment of the computed result is 64, and that of undefined data is 128.)

Template Sample

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
#include "kernel_operator.h"
class KernelDuplicate {
public:
    __aicore__ inline KernelDuplicate() {}
    __aicore__ inline void Init(__gm__ uint8_t* src, __gm__ uint8_t* dstGm)
    {
        srcGlobal.SetGlobalBuffer((__gm__ half*)src);
        dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm);
        pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(half));
        pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>();
        AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize);
        inQueueSrc.EnQue(srcLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>();
        AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
        half inputVal(18.0);
        AscendC::Duplicate<half>(dstLocal, inputVal, srcDataSize);
        outQueueDst.EnQue<half>(dstLocal);
        inQueueSrc.FreeTensor(srcLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>();
        AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<half> srcGlobal, dstGlobal;
    int srcDataSize = 256;
    int dstDataSize = 256;
};
extern "C" __global__ __aicore__ void duplicate_kernel(__gm__ uint8_t* src, __gm__ uint8_t* dstGm)
{
    KernelDuplicate op;
    op.Init(src, dstGm);
    op.Process();
}