AscendAntiQuant

Applicability

Product	Supported
Atlas A3 training products / Atlas A3 inference products	√
Atlas A2 training products / Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	x
Atlas inference product 's AI Core	√
Atlas inference product 's Vector Core	x
Atlas training products	x

Function

Performs fake quantization by element. For example, apply fake quantization to convert the int8_t data type to the half data type. The formulas are as follows:

PER_CHANNEL scenario (quantization by channel)
- Input transposing disabled
  groupSize = src.shape[0]/offset.shape[0]
  
  dst[i][j] = scale[i/groupSize][j] x (src[i][j] + offset[i/groupSize][j])
- Input transposing enabled
  groupSize = src.shape[1]/offset.shape[1]
  
  dst[i][j] = scale[i][j/groupSize] x (src[i][j] + offset[i][j/groupSize])

PER_TENSOR scenario (quantization by tensor)
dst[i][j] = scale * (src[i][j] + offset)l

Principles

Figure 1 AscendAntiQuant algorithm block diagram

The preceding figure shows the algorithm block diagram of AscendAntiQuant in typical scenarios. The computation process is divided into the following steps, all of which are performed on vectors:

Precision conversion: Convert the input src to the half type.
Offset calculation: Perform Add calculation when offset is a vector, and perform Adds calculation when offset is a scalar.
Scale calculation: Perform Mul calculation when scale is a vector, and perform Muls calculation when scale is a scalar.

Figure 2 AscendAntiQuant algorithm block diagram when isTranspose is False and the output is of the bfloat16 type

On the Atlas A2 training products / Atlas A2 inference products , when the output is of the bfloat16 type, the computation process is divided into the following steps:

src precision conversion: Convert the input src to the half type, and then convert it to the float type and store it in tmp1.
Offset precision conversion: When the input offset is a vector, it is converted to the float type and stored in tmp2. When the input offset is a scalar, it is converted to the float type through ToFloat.
Offset calculation: When the input offset is a vector, perform the Add calculation with tmp2. When it is a scalar, perform the Adds calculation.
Scale precision conversion: When the input scale is a vector, it is converted to the float type and stored in tmp2. When the input scale is a scalar, it is converted to the float type through ToFloat.
Scale calculation: When the input scale is a vector, perform the Mul calculation with tmp2. When it is a scalar, perform the Muls calculation.
dst precision conversion: Convert tmp1 to the bf16 type.

Prototype

Pass to the temporary space through the sharedTmpBuffer input parameter.

PER_CHANNEL scenario (quantization by channel)

          
               template <typename InputDataType, typename OutputDataType, bool isTranspose>
__aicore__ inline void AscendAntiQuant(const LocalTensor<OutputDataType>& dst, const LocalTensor<InputDataType>& src, const LocalTensor<OutputDataType>& offset, const LocalTensor<OutputDataType>& scale, const LocalTensor<uint8_t>& sharedTmpBuffer, const uint32_t k, const AntiQuantShapeInfo& shapeInfo = {})

PER_CHANNEL scenario (quantization by channel without offset)

          
               template <typename InputDataType, typename OutputDataType, bool isTranspose>
__aicore__ inline void AscendAntiQuant(const LocalTensor<OutputDataType>& dst, const LocalTensor<InputDataType>& src, const LocalTensor<OutputDataType>& scale, const LocalTensor<uint8_t>& sharedTmpBuffer, const uint32_t k, const AntiQuantShapeInfo& shapeInfo = {})

PER_TENSOR scenario (quantization by tensor)

          
               template <typename InputDataType, typename OutputDataType, bool isTranspose>
__aicore__ inline void AscendAntiQuant(const LocalTensor<OutputDataType>& dst, const LocalTensor<InputDataType>& src, const OutputDataType offset, const OutputDataType scale, const LocalTensor<uint8_t>& sharedTmpBuffer, const uint32_t k, const AntiQuantShapeInfo& shapeInfo = {})

PER_TENSOR scenario (quantization by tensor without offset)

          
               template <typename InputDataType, typename OutputDataType, bool isTranspose>
__aicore__ inline void AscendAntiQuant(const LocalTensor<OutputDataType> &dst, const LocalTensor<InputDataType> &src, const OutputDataType scale, const LocalTensor<uint8_t> &sharedTmpBuffer, const uint32_t k, const AntiQuantShapeInfo& shapeInfo = {})

Allocate the temporary space through the API framework.

PER_CHANNEL scenario

          
               template <typename InputDataType, typename OutputDataType, bool isTranspose>
__aicore__ inline void AscendAntiQuant(const LocalTensor<OutputDataType>& dst, const LocalTensor<InputDataType>& src, const LocalTensor<OutputDataType>& offset, const LocalTensor<OutputDataType>& scale, const uint32_t k, const AntiQuantShapeInfo& shapeInfo = {})

PER_TENSOR scenario

          
               template <typename InputDataType, typename OutputDataType, bool isTranspose>
__aicore__ inline void AscendAntiQuant(const LocalTensor<OutputDataType>& dst, const LocalTensor<InputDataType>& src, const OutputDataType offset, const OutputDataType scale, const uint32_t k, const AntiQuantShapeInfo& shapeInfo = {})

Due to the complex mathematical computation involved in the internal implementation of this API, extra temporary space is required to store intermediate variables generated during computation. The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter.

When the API framework is used for temporary space allocation, you do not need to allocate the space, but must reserve the required size for the temporary space.

When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.

If the API framework is used, developers must reserve the temporary space. If sharedTmpBuffer is used, developers must allocate space for sharedTmpBuffer. To obtain the size of the temporary space (BufferSize) to be reserved, use the API provided in GetAscendAntiQuantMaxMinTmpSize.

Parameters

**Table 1** Template parameters
Parameter	Description
InputDataType	Input data type.
OutputDataType	Output data type.
isTranspose	Whether to enable input data transposing.

Table 2 API parameters

Parameter

Input/Output

Description

dst

Output

Destination operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

For the Atlas A3 training products / Atlas A3 inference products , the supported data types are half and bfloat16_t.

For the Atlas A2 training products / Atlas A2 inference products , the supported data types are half and bfloat16_t.

For the Atlas inference product 's AI Core, the supported data type is half.

src

Input

Source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

For the Atlas A3 training products / Atlas A3 inference products , the supported data types are int8_t and int4b_t.

For the Atlas A2 training products / Atlas A2 inference products , the supported data types are int8_t and int4b_t.

For the Atlas inference product 's AI Core, the supported data type is int8_t.

offset

Input

Offset when the input data is dequantized.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

For the Atlas A3 training products / Atlas A3 inference products , the supported data types are half and bfloat16_t.

For the Atlas A2 training products / Atlas A2 inference products , the supported data types are half and bfloat16_t.

For the Atlas inference product 's AI Core, the supported data type is half.

scale

Input

Scaling factor when the input data is dequantized.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

For the Atlas A3 training products / Atlas A3 inference products , the supported data types are half and bfloat16_t.

For the Atlas A2 training products / Atlas A2 inference products , the supported data types are half and bfloat16_t.

For the Atlas inference product 's AI Core, the supported data type is half.

sharedTmpBuffer

Input

Temporary buffer.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

For details about how to obtain the temporary space size (BufferSize), see GetAscendAntiQuantMaxMinTmpSize.

Input

If isTranspose is set to true, the shape of src is [N, K]. If isTranspose is set to false, the shape of src is [K, N].

The parameter k corresponds to the K value.

shapeInfo

Input

Shape information of offset and scale. This parameter is configured only in the PER_CHANNEL scenario.

This parameter is optional. In the PER_CHANNEL scenario, if this parameter is not specified or the data in the struct is set to 0, the shape information of offset and scale is obtained from ShapeInfo.

The AntiQuantShapeInfo type is defined as follows:

           
                struct AntiQuantShapeInfo {
    uint32_t offsetHeight{0};  // Offset height
    uint32_t offsetWidth{0};  // Offset width
    uint32_t scaleHeight{0};  // Scale height
    uint32_t scaleWidth{0};  // Scale width
};

Returns

None

Restrictions

The source operand address must not overlap the destination operand address.
For details about the operand address alignment requirements, see General Address Alignment Restrictions.
The length of the data involved in computation of the input and output operands must be 32-byte aligned.
When inputs are transposed, k must be 32-byte aligned.
Before calling the API, ensure that the size of the input data is correct, and the size and shape of offset and scale are correct.

Example

      
       
         
         
           #include "kernel_operator.h"

template <typename InputType, typename OutType>
class AntiQuantTest
{
public:
    __aicore__ inline AntiQuantTest() {}
    __aicore__ inline void Init(GM_ADDR dstGm, GM_ADDR srcGm, GM_ADDR offsetGm, GM_ADDR scaleGm,
                                uint32_t elementCountOfInput, uint32_t elementCountOfOffset, uint32_t K)
    {
        elementCountOfInput = elementCountOfInput;
        elementCountOfOffset = elementCountOfOffset;
        k = K;

        dstGlobal.SetGlobalBuffer((__gm__ OutType *)dstGm);
        srcGlobal.SetGlobalBuffer((__gm__ InputType *)srcGm);
        offsetGlobal.SetGlobalBuffer((__gm__ OutType *)offsetGm);
        scaleGlobal.SetGlobalBuffer((__gm__ OutType *)scaleGm);

        pipe.InitBuffer(queInSrc, 1, elementCountOfInput * sizeof(InputType));
        pipe.InitBuffer(queInOffset, 1, elementCountOfOffset * sizeof(OutType));
        pipe.InitBuffer(queInScale, 1, elementCountOfOffset * sizeof(OutType));
        pipe.InitBuffer(queOut, 1, elementCountOfInput * sizeof(OutType));
        pipe.InitBuffer(queTmp, 1, 67584);
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }

private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<InputType> srcLocal = queInSrc.AllocTensor<InputType>();
        AscendC::DataCopy(srcLocal, srcGlobal, elementCountOfInput);
        queInSrc.EnQue(srcLocal);

        AscendC::LocalTensor<OutType> offsetLocal = queInOffset.AllocTensor<OutType>();
        AscendC::DataCopy(offsetLocal, offsetGlobal, elementCountOfOffset);
        queInOffset.EnQue(offsetLocal);

        AscendC::LocalTensor<OutType> scaleLocal = queInScale.AllocTensor<OutType>();
        AscendC::DataCopy(scaleLocal, scaleGlobal, elementCountOfOffset);
        queInScale.EnQue(scaleLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<InputType> srcLocal = queInSrc.DeQue<InputType>();
        AscendC::LocalTensor<OutType> offsetLocal = queInOffset.DeQue<OutType>();
        AscendC::LocalTensor<OutType> scaleLocal = queInScale.DeQue<OutType>();
        AscendC::LocalTensor<OutType> dstLocal = queOut.AllocTensor<OutType>();
        AscendC::LocalTensor<uint8_t> sharedTmpBuffer = queTmp.AllocTensor<uint8_t>();

        AscendC::AntiQuantShapeInfo shapeInfo = {1, elementCountOfOffset, 1, elementCountOfOffset};
        AscendC::AscendAntiQuant<InputType, OutType, false>(dstLocal, srcLocal, offsetLocal, scaleLocal, sharedTmpBuffer, k, shapeInfo);
        queInSrc.FreeTensor(srcLocal);
        queInOffset.FreeTensor(offsetLocal);
        queInScale.FreeTensor(scaleLocal);
        queTmp.FreeTensor(sharedTmpBuffer);
        queOut.EnQue(dstLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<OutType> dstLocal = queOut.DeQue<OutType>();
        AscendC::DataCopy(dstGlobal, dstLocal, elementCountOfInput);
        queOut.FreeTensor(dstLocal);
    }

private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> queInSrc;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> queInOffset;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> queInScale;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> queTmp;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> queOut;
    AscendC::GlobalTensor<OutType> dstGlobal;
    AscendC::GlobalTensor<InputType> srcGlobal;
    AscendC::GlobalTensor<OutType> offsetGlobal;
    AscendC::GlobalTensor<OutType> scaleGlobal;
    uint32_t elementCountOfInput;
    uint32_t elementCountOfOffset;
    uint32_t k;
}; // class AntiQuantTest

extern "C" __global__ __aicore__ void kernel_anti_quant(GM_ADDR dst, GM_ADDR src, GM_ADDR offset, GM_ADDR scale,
                                                        uint32_t elementCountOfInput, uint32_t elementCountOfOffset, uint32_t K)
{
    AscendC::AntiQuantTest<InputType, OutType> op;
    op.Init(dst, src, offset, scale, elementCountOfInput, elementCountOfOffset, K);
    op.Process();
}

          

        

      
     

Result example:

       
        
          
          
            Input data src (shape: [2, 64], non-transposing scenario):
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
offset (shape: [1, 64]):
[2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
scale (shape: [1, 64]):
[3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.
 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.
 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.]
Output data dstLocal (shape: [2, 64]):
[9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9.
 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9.
 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9.
 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9.
 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9.
 9. 9. 9. 9. 9. 9. 9. 9.]

           

         

       
      

Parent topic: Quantization