CreateVecIndex

Function Usage

Creates the vector index with firstValue as the start value.

Prototype

  • Computation of the first n data elements of a tensor
    1
    2
    template <typename T>
    __aicore__ inline void CreateVecIndex(LocalTensor<T> dstLocal, const T &firstValue, uint32_t calCount)
    
  • High-dimensional tensor sharding computation
    • Bitwise mask mode
      1
      2
      template <typename T>
      __aicore__ inline void CreateVecIndex(LocalTensor<T> &dstLocal, const T &firstValue, uint64_t mask, uint8_t repeatTimes, uint16_t dstBlkStride, uint8_t dstRepStride)
      
    • Contiguous mask mode
      1
      2
      template <typename T>
      __aicore__ inline void CreateVecIndex(LocalTensor<T> &dstLocal, const T &firstValue, uint64_t mask[], uint8_t repeatTimes, uint16_t dstBlkStride, uint8_t dstRepStride)
      

Parameters

Table 1 Parameters

Parameter

Input/Output

Description

dstLocal

Output

Destination operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The start address of the LocalTensor must be 32-byte aligned.

firstValue

Input

The first value of the index. The data type must be the same as that of the element in dstLocal.

calCount

Input

Number of elements of the input data.

mask

Input

mask is used to control the elements that participate in computation in each iteration.

  • Contiguous mode: indicates the number of contiguous elements that participate in computation. The value range is related to the operand data type. The maximum number of elements that can be processed in each iteration varies according to the data type. When the operand is 16-bit, mask ∈ [1, 128]. When the operand is 32-bit, mask ∈ [1, 64]. When the operand is 64-bit, mask ∈ [1, 32].
  • Bitwise mode: controls the elements that participate in computation by bit. If a bit is set to 1, the corresponding element participates in the computation. If a bit is set to 0, the corresponding element is masked in the computation. The parameter type is a uint64_t array whose length is 2.

    For example, if mask = [0, 8] and 8 = 0b1000, only the fourth element participates in computation.

    The parameter value range is related to the operand data type. The maximum number of elements that can be processed in each iteration varies according to the data type. When the operand is 16-bit, mask[0] and mask[1] ∈ [0, 264 -1] and cannot be 0 at the same time. When the operand is 32-bit, mask[1] is 0 and mask[0] ∈ (0, 264 – 1]. When the operand is 64-bit, mask[1] is 0 and mask[0] ∈ (0, 232 – 1].

repeatTimes

Input

Number of iteration repeats. The Vector Unit reads 256 bytes of contiguous data for computation each time. To read the complete data for processing, the unit needs to read the input data in multiple repeats. repeatTimes indicate the number of iterations.

For details about this parameter, see Common Parameters.

dstBlkStride

Input

Address stride of the destination operand between different data blocks in a single repeat For details, see dataBlockStride.

dstRepStride

Input

Address stride of the destination operand for the same data block between adjacent repeats For details, see repeatStride.

Availability

Precautions

  • For details about the alignment requirements of the operand address offset, see General Restrictions.
  • Ensure that the value of firstValue does not exceed the size range corresponding to the element data type in dstLocal.

Returns

None

Example

This example shows only part of the code used in the computation process. To run the sample code, copy the code snippet and replace related code snippets of the Compute function in Template Sample.

  • Example of high-dimensional tensor sharding computation (contiguous mask mode)
    uint64_t mask = 128;
    // repeatTimes = 1
    // dstBlkStride = 1. Data is continuously written in a single repeat.
    // dstRepStride = 8. Data is continuously written in adjacent iterations.
    AscendC::CreateVecIndex(dstLocal, (T)0, mask, repeatTimes, dstBlkStride, dstRepStride);
  • Example of high-dimensional tensor sharding computation (bitwise mask mode)
    uint64_t mask[2] = { UINT64_MAX, UINT64_MAX };
    // repeatTimes = 1
    // dstBlkStride = 1. Data is continuously written in a single repeat.
    // dstRepStride = 8. Data is continuously written in adjacent iterations.
    AscendC::CreateVecIndex(dstLocal, (T)0, mask, repeatTimes, dstBlkStride, dstRepStride);
  • Example of computing the first n data elements of a tensor
    AscendC::CreateVecIndex(dstLocal, (T)0, 128);
Result example:
Input data (firstValue): 0
Output data (dstLocal): [0 1 2... 127]

Template Sample

#include "kernel_operator.h"
template <typename T>
class CreateVecIndexTest {
public:
    __aicore__ inline CreateVecIndexTest() {}
    __aicore__ inline void Init(GM_ADDR dstGm, uint64_t mask, uint8_t repeatTimes,
        uint16_t dstBlkStride, uint8_t dstRepStride)
    {
        m_mask = mask;
        m_repeatTimes = repeatTimes;
        m_dstBlkStride = dstBlkStride;
        m_dstRepStride = dstRepStride;
        m_elementCount = m_dstBlkStride * m_dstRepStride * 32 * m_repeatTimes / sizeof(T);
        m_dstGlobal.SetGlobalBuffer((__gm__ T*)dstGm);
        m_pipe.InitBuffer(m_queOut, 1, m_dstBlkStride * m_dstRepStride * 32 * m_repeatTimes);
        m_pipe.InitBuffer(m_queTmp, 1, 1024);
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        ;
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<T> dstLocal = m_queOut.AllocTensor<T>();
        AscendC::LocalTensor<uint8_t> tmpLocal = m_queTmp.AllocTensor<uint8_t>();
        AscendC::Duplicate(dstLocal, (T)0, m_elementCount);
        AscendC::PipeBarrier<PIPE_ALL>();
        AscendC::CreateVecIndex(dstLocal, (T)0, m_repeatTimes * 256 / sizeof(T));
        m_queOut.EnQue(dstLocal);
        m_queTmp.FreeTensor(tmpLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<T> dstLocal = m_queOut.DeQue<T>();

        AscendC::DataCopy(m_dstGlobal, dstLocal, m_elementCount);
        m_queOut.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe m_pipe;
    uint32_t m_elementCount;
    uint32_t m_mask;
    uint32_t m_repeatTimes;
    uint32_t m_dstBlkStride;
    uint32_t m_dstRepStride;
    AscendC::GlobalTensor<T> m_dstGlobal;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> m_queOut;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> m_queTmp;
}; // class CreateVecIndexTest
template <typename T>
__global__ __aicore__ void testCreateVecIndex(GM_ADDR dstGm, uint64_t mask, uint8_t repeatTimes,
        uint16_t dstBlkStride, uint8_t dstRepStride)
{
    CreateVecIndexTest<T> op;
    op.Init(dstGm, mask, repeatTimes, dstBlkStride, dstRepStride);
    op.Process();
}