CastDeq

Applicability

Product	Supported
Atlas A3 training products / Atlas A3 inference products	√
Atlas A2 training products / Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	x
Atlas inference product 's AI Core	√
Atlas inference product 's Vector Core	x
Atlas training products	x

Function

Quantizes the input and converts its precision. The conversion formula varies with the data type.

For int16_t input, this API quantizes and converts the int16_t data to int8_t or uint8_t data. Before using this API, call SetDeqScale to configure quantization parameters such as scale, offset, and signMode.
The template parameter isVecDeq controls whether to enable vector quantization.
- When isVecDeq is set to false, this API quantizes the input and converts its precision according to the scale, offset, and signMode values set via SetDeqScale. The formula is as follows:
  $\text{[math]}$
- When isVecDeq is set to true, this API quantizes the input and converts its precision in a cyclic manner based on the 16 sets of quantization parameters scale₀ to scale₁₅, offset₀ to offset₁₅, and signMode₀ to signMode₁₅ stored in a 128-byte UB region set via SetDeqScale. The formula is as follows:
  $\text{[math]}$
For int32_t input, this API quantizes and converts the int32_t data to half data. Before using this API, call SetDeqScale to set the scale parameter.
. $\text{[math]}$

Prototype

Computation of the first n data elements of a tensor

        
             template <typename T, typename U, bool isVecDeq = true, bool halfBlock = true>
__aicore__ inline void CastDeq(const LocalTensor<T>& dst, const LocalTensor<U>& src, const uint32_t count)

High-dimensional tensor sharding computation

Bitwise mask mode

          
               template <typename T, typename U, bool isSetMask = true, bool isVecDeq = true, bool halfBlock = true>
__aicore__ inline void CastDeq(const LocalTensor<T>& dst, const LocalTensor<U>& src, const uint64_t mask[], uint8_t repeatTime, const UnaryRepeatParams& repeatParams)

Contiguous mask mode

          
               template <typename T, typename U, bool isSetMask = true, bool isVecDeq = true, bool halfBlock = true>
__aicore__ inline void CastDeq(const LocalTensor<T>& dst, const LocalTensor<U>& src, const int32_t mask, uint8_t repeatTime, const UnaryRepeatParams& repeatParams)

Parameters

**Table 1** Template parameters
Parameter	Description
T	Data type of the output tensor. For Atlas A3 training products / Atlas A3 inference products , the supported data types are int8_t, uint8_t, and half. For Atlas A2 training products / Atlas A2 inference products , the supported data types are int8_t, uint8_t, and half. For the Atlas inference product 's AI Core, the supported data types are int8_t and uint8_t. This parameter is used together with the input parameter signMode of the SetDeqScale API. When signMode is set to true, the output data type is int8_t. When signMode is set to false, the output data type is uint8_t.
U	Data type of the input tensor. For Atlas A3 training products / Atlas A3 inference products , the supported data types are int16_t and int32_t. For Atlas A2 training products / Atlas A2 inference products , the supported data types are int16_t and int32_t. For the Atlas inference product 's AI Core, the supported data type is int16_t.
isSetMask	Indicates whether to set mask inside the API. true: sets mask inside the API. false: sets mask outside the API. Developers need to use the SetVectorMask API to set the mask value. In this mode, the mask value in the input parameter of this API must be set to the placeholder MASK_PLACEHOLDER.
isVecDeq	Whether to enable vector quantization. This parameter must be used together with the SetDeqScale(const LocalTensor<T>& src) API. When a tensor is passed to SetDeqScale, isVecDeq must be set to true.
halfBlock	When the int16_t input is quantized and its precision is converted to int8_t or uint8_t, the halfBlock parameter specifies whether the output element is stored in the lower or upper half of the block. When halfBlock is set to true, the result is stored in the lower half of the block; when set to false, the result is stored in the upper half of the block, as shown in Figure 1.

**Table 2** API parameters
Parameter	Input/Output	Description
dst	Output	Destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned.
src	Input	Source operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned.
mask/mask[]	Input	mask is used to control the elements that participate in computation in each iteration. Bitwise mode: controls which elements are involved in computation bit by bit. A bit value of 1 means the corresponding element participates in computation, while 0 means it does not. The mask value is an array. The array length and the value range of the array elements are related to the operand data type. When the operand is 16-bit, the array length is 2, mask[0] and mask[1] ∈ [0, 2⁶⁴ -1] and cannot be 0 at the same time. When the operand is 32-bit, the array length is 1 and mask[0] ∈ (0, 2⁶⁴ – 1]. When the operand is 64-bit, the array length is 1 and mask[0] ∈ (0, 2³² – 1]. For example, if mask = [0, 8] and 8 = 0b1000, only the fourth element participates in computation. Contiguous mode: indicates the number of contiguous elements that participate in computation. The value range is related to the operand data type. The maximum number of elements that can be processed in each repeat varies according to the data type. When the operand is 16-bit, mask ∈ [1, 128]. When the operand is 32-bit, mask ∈ [1, 64]. When the operand is 64-bit, mask ∈ [1, 32]. When the number of bits of the source operand is different from that of the destination operand, the data type with more bytes is used for the computation. For example, when the source operand is of type int16_t and the destination operand is of type int8_t, the mask is calculated based on the int16_t type.
repeatTime	Input	Number of repeat iterations. The vector compute unit reads 256 bytes of contiguous data for computation each time. To process the input data, the data needs to be read and computed over multiple repeats. repeatTime indicates the number of repeats. For details about this parameter, see High-dimensional Sharding APIs.
repeatParams	Input	Parameters that control the operand address strides. This parameter is of the UnaryRepeatParams type, including the address stride of the same DataBlock between adjacent iterations of the operand and the address stride of different DataBlocks within the same iteration of the operand. For details about the address stride of the operand between adjacent iterations, see repeatStride. For details about the address stride of the operand between different data blocks in a single iteration, see dataBlockStride.
count	Input	Number of elements involved in the computation.

Figure 1 Description of halfBlock

Returns

None

Restrictions

For details about the operand address alignment requirements, see General Address Alignment Restrictions.
For details about the constraints on operand address overlapping, see General Address Overlapping Restrictions.

Example

To run the sample code, copy the code snippet and replace some code of the Compute function in Template Sample.

Example of high-dimensional sharding computation API - contiguous mask mode

        
             int32_t mask = 256 / sizeof(int16_t);
// repeatTime = 2, 128 elements one repeat, 256 elements total
// dstBlkStride, srcBlkStride = 1, no gap between blocks in one repeat
// dstRepStride, srcRepStride = 8, no gap between repeats
AscendC::CastDeq<uint8_t, int16_t, true, true, true>(dstLocal, srcLocal, mask, 2, { 1, 1, 8, 8 });

Example of high-dimensional sharding computation API - bitwise mask mode

        
             uint64_t mask[2] = { UINT64_MAX, UINT64_MAX };
// repeatTime = 2, 128 elements one repeat, 256 elements total
// dstBlkStride, srcBlkStride = 1, no gap between blocks in one repeat
// dstRepStride, srcRepStride = 8, no gap between repeats
AscendC::CastDeq<uint8_t, int16_t, true, true, true>(dstLocal, srcLocal, mask, 2, { 1, 1, 8, 8 });

Example of first n pieces of tensor data computation API

        
             AscendC::CastDeq<uint8_t, int16_t, true, true>(dstLocal, srcLocal, 256);

Result example:

Input (srcLocal):
[20 53 26 12 36  6 20 93 66 30 56 99 59 92  7 37 22 47 98 10 85 29 14 46
 17 34 45 17 25 45 82 17 66 94 68 23 67  8 89  8 92  6 10 80 87 20  9 81
 70 62 11 58 38 83 32 14 38 47 41 63 94 26 96 89 88 35 86 55 60 82 15 65
 92 67 83 23 63 25 85 93 50 91 75 60 80 10 55 20 71 14 67 23 31 63  7 93
 69 45 61 23 43 86 11 81 81 36 76 58 53 25 23 51 59 78 82 10 39 40 24 50
 68 49 79 40  4 53 22 38 45 17 29 54  9 66 98 47 12 47 47 20 98  0 59 77
  1 21 39 70 66 20 68  8 77 77 54  0  3 33 37 37 48 60 83 88 27 70 31 49
 75 21 59  3 99 84 92 84 14 44 26 56 72 56 37 52 39 11  2 59 59 65 71 64
 10 65 62 48 42 79 69 69 27 99  8 38 36 77 34 34 60 50 52 50 41 31 95 68
 27 16 42 64 19 47  0 10 36 36 33 62 98 64 32 81 49 53 27 70 35  9 63  7
 10 89  3 39 94 23 89 16 23 60 71 42 46 58 65 90]
Output (dstLocal):
[ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 20 53 26 12 36  6 20 93
 66 30 56 99 59 92  7 37  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 22 47 98 10 85 29 14 46 17 34 45 17 25 45 82 17  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0 66 94 68 23 67  8 89  8 92  6 10 80 87 20  9 81
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 70 62 11 58 38 83 32 14
 38 47 41 63 94 26 96 89  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 88 35 86 55 60 82 15 65 92 67 83 23 63 25 85 93  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0 50 91 75 60 80 10 55 20 71 14 67 23 31 63  7 93
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 69 45 61 23 43 86 11 81
 81 36 76 58 53 25 23 51  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 59 78 82 10 39 40 24 50 68 49 79 40  4 53 22 38  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0 45 17 29 54  9 66 98 47 12 47 47 20 98  0 59 77
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  1 21 39 70 66 20 68  8
 77 77 54  0  3 33 37 37  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 48 60 83 88 27 70 31 49 75 21 59  3 99 84 92 84  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0 14 44 26 56 72 56 37 52 39 11  2 59 59 65 71 64
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10 65 62 48 42 79 69 69
 27 99  8 38 36 77 34 34  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0
 60 50 52 50 41 31 95 68 27 16 42 64 19 47  0 10  0  0  0  0  0  0  0  0
  0  0  0  0  0  0  0  0 36 36 33 62 98 64 32 81 49 53 27 70 35  9 63  7
  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0 10 89  3 39 94 23 89 16
 23 60 71 42 46 58 65 90]

Template Sample

This section provides a template sample to help you quickly run reference instruction samples.

       
        
          
          
            #include "kernel_operator.h"
template <typename srcType, typename dstType>
class KernelCastDeq {
public:
    __aicore__ inline KernelCastDeq() {}
    __aicore__ inline void Init(GM_ADDR src_gm, GM_ADDR dst_gm, uint32_t inputSize, bool halfBlock, bool isVecDeq)
    {
        srcSize = inputSize;
        dstSize = inputSize * 2;
        this->halfBlock = halfBlock;
        this->isVecDeq = isVecDeq;
        src_global.SetGlobalBuffer(reinterpret_cast<__gm__ srcType*>(src_gm), srcSize);
        dst_global.SetGlobalBuffer(reinterpret_cast<__gm__ dstType*>(dst_gm), dstSize);
        pipe.InitBuffer(inQueueX, 1, srcSize * sizeof(srcType));
        pipe.InitBuffer(outQueue, 1, dstSize * sizeof(dstType));
        pipe.InitBuffer(tmpQueue, 1, 128);
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<srcType> srcLocal = inQueueX.AllocTensor<srcType>();
        AscendC::DataCopy(srcLocal, src_global, srcSize);
        inQueueX.EnQue(srcLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<dstType> dstLocal = outQueue.AllocTensor<dstType>();
        AscendC::LocalTensor<uint64_t> tmpBuffer = tmpQueue.AllocTensor<uint64_t>();
        AscendC::Duplicate(tmpBuffer.ReinterpretCast<int32_t>(), static_cast<int32_t>(0), 32);
        AscendC::PipeBarrier<PIPE_V>();
        AscendC::Duplicate<int32_t>(dstLocal.template ReinterpretCast<int32_t>(), static_cast<int32_t>(0), dstSize / sizeof(int32_t));
        AscendC::PipeBarrier<PIPE_ALL>();
        bool signMode = false;
        if constexpr (AscendC::Std::is_same<dstType, int8_t>::value) {
            signMode = true;
        }
        AscendC::LocalTensor<srcType> srcLocal = inQueueX.DeQue<srcType>();
        if (halfBlock) {
            if (isVecDeq) {
                float vdeqScale[16] = { 1.0 };
                int16_t vdeqOffset[16] = { 0 };
                bool vdeqSignMode[16] = { signMode };
                AscendC::VdeqInfo vdeqInfo(vdeqScale, vdeqOffset, vdeqSignMode);
                AscendC::SetDeqScale(tmpBuffer, vdeqInfo);
                AscendC::CastDeq<dstType, srcType, true, true>(dstLocal, srcLocal, srcSize);
            } else {
                float scale = 1.0;
                int16_t offset = 0;
                AscendC::SetDeqScale(scale, offset, signMode);
                AscendC::CastDeq<dstType, srcType, false, true>(dstLocal, srcLocal, srcSize);
            }
        } else {
            if (isVecDeq) {
                float vdeqScale[16] = { 1.0 };
                int16_t vdeqOffset[16] = { 0 };
                bool vdeqSignMode[16] = { signMode };
                AscendC::VdeqInfo vdeqInfo(vdeqScale, vdeqOffset, vdeqSignMode);
                AscendC::SetDeqScale(tmpBuffer, vdeqInfo);
                AscendC::CastDeq<dstType, srcType, true, false>(dstLocal, srcLocal, srcSize);
            } else {
                float scale = 1.0;
                int16_t offset = 0;
                AscendC::SetDeqScale(scale, offset, signMode);
                AscendC::CastDeq<dstType, srcType, false, false>(dstLocal, srcLocal, srcSize);
            }
        }
        outQueue.EnQue<dstType>(dstLocal);
        tmpQueue.FreeTensor(tmpBuffer);
        inQueueX.FreeTensor(srcLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<dstType> dstLocal = outQueue.DeQue<dstType>();
        AscendC::DataCopy(dst_global, dstLocal, dstSize);
        outQueue.FreeTensor(dstLocal);
    }
private:
    AscendC::GlobalTensor<srcType> src_global;
    AscendC::GlobalTensor<dstType> dst_global;
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueX;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> tmpQueue;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueue;
    bool halfBlock = false;
    bool isVecDeq = false;
    uint32_t srcSize = 0;
    uint32_t dstSize = 0;
};
template <typename srcType, typename dstType>
__aicore__ void kernel_cast_deqscale_operator(GM_ADDR src_gm, GM_ADDR dst_gm, uint32_t dataSize, bool halfBlock, bool isVecDeq)
{
    KernelCastDeq<srcType, dstType> op;
    op.Init(src_gm, dst_gm, dataSize, halfBlock, isVecDeq);
    op.Process();
}
extern "C" __global__ __aicore__ void kernel_cast_deqscale_operator_256_int16_t_uint8_t_true_true(GM_ADDR src_gm, GM_ADDR dst_gm)
{
    kernel_cast_deqscale_operator<int16_t, uint8_t>(src_gm, dst_gm, 256, true, true);
}

           

         

       
      

Parent topic: Compound Computation