AscendAntiQuant

Function Description

Performs fake quantization by element. For example, apply fake quantization to convert the int8_t data type to the half data type. The calculation formulas are as follows:

  • Per_channel scenario
    • Input transpose disabled

      groupSize = src.shape[0] / offset.shape[0]

      dst[i][j] = scale[i / groupSize][j] * (src[i][j] + offset[i / groupSize][j])

    • Input transpose enabled

      groupSize = src.shape[1] / offset.shape[1]

      dst[i][j] = scale[i][j / groupSize] * (src[i][j] + offset[i][j / groupSize])

  • Per_tensor scenario

    dst[i][j] = scale * (src[i][j] + offset)

Principles

Figure 1 AscendAntiQuant algorithm block diagram

The preceding figure shows the algorithm block diagram of AscendAntiQuant in typical scenarios. The computation process is divided into the following steps, all of which are performed on the Vector Core:

  1. Precision conversion: Convert the input src to the half type.
  2. Offset calculation: Perform Add calculation when offset is a vector, and perform Adds calculation when offset is a scalar.
  3. Scale calculation: Perform Mul calculation when scale is a vector, and perform Muls calculation when scale is a scalar.

Prototype

  • Pass the temporary space through the sharedTmpBuffer input parameter.
    • Per_channel scenario
      1
      2
      template <typename InputDataType, typename OutputDataType, bool isTranspose>
      __aicore__ inline void AscendAntiQuant(const LocalTensor<OutputDataType> &dst, const LocalTensor<InputDataType> &src, const LocalTensor<OutputDataType> &offset, const LocalTensor<OutputDataType> &scale, const LocalTensor<uint8_t> &sharedTmpBuffer, const uint32_t K, const AntiQuantShapeInfo& shapeInfo = {})
      
    • Per_channel scenario (without offset)
      1
      2
      template <typename InputDataType, typename OutputDataType, bool isTranspose>
      __aicore__ inline void AscendAntiQuant(const LocalTensor<OutputDataType> &dst, const LocalTensor<InputDataType> &src, const LocalTensor<OutputDataType> &scale, const LocalTensor<uint8_t> &sharedTmpBuffer, const uint32_t K, const AntiQuantShapeInfo& shapeInfo = {})
      
    • Per_tensor scenario
      1
      2
      template <typename InputDataType, typename OutputDataType, bool isTranspose>
      __aicore__ inline void AscendAntiQuant(const LocalTensor<OutputDataType> &dst, const LocalTensor<InputDataType> &src, const OutputDataType offset, const OutputDataType scale, const LocalTensor<uint8_t> &sharedTmpBuffer, const uint32_t K, const AntiQuantShapeInfo& shapeInfo = {})
      
    • Per_tensor scenario (without offset)
      1
      2
      template <typename InputDataType, typename OutputDataType, bool isTranspose>
      __aicore__ inline void AscendAntiQuant(const LocalTensor<OutputDataType> &dst, const LocalTensor<InputDataType> &src, const OutputDataType scale, const LocalTensor<uint8_t> &sharedTmpBuffer, const uint32_t K, const AntiQuantShapeInfo& shapeInfo = {})
      
  • Allocate the temporary space through the API framework.
    • Per_channel scenario
      1
      2
      template <typename InputDataType, typename OutputDataType, bool isTranspose>
      __aicore__ inline void AscendAntiQuant(const LocalTensor<OutputDataType> &dst, const LocalTensor<InputDataType> &src, const LocalTensor<OutputDataType> &offset, const LocalTensor<OutputDataType> &scale, const uint32_t K, const AntiQuantShapeInfo& shapeInfo = {})
      
    • Per_tensor scenario
      1
      2
      template <typename InputDataType, typename OutputDataType, bool isTranspose>
      __aicore__ inline void AscendAntiQuant(const LocalTensor<OutputDataType> &dst, const LocalTensor<InputDataType> &src, const OutputDataType offset, const OutputDataType scale, const uint32_t K, const AntiQuantShapeInfo& shapeInfo = {})
      

Due to the complex mathematical computation involved in the internal implementation of this API, additional temporary space is required to store intermediate variables generated during computation. The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter.

  • When the API framework is used for temporary space allocation, developers do not need to allocate the space, but must reserve the required size for the space.
  • When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.

If the API framework is used, developers must reserve the temporary space. If sharedTmpBuffer is used, developers must allocate space for sharedTmpBuffer. To obtain the size of the temporary space (BufferSize) to be reserved, use the API provided in GetAscendAntiQuantMaxMinTmpSize.

Parameters

Table 1 Parameters in the template

Parameter

Description

InputDataType

Input data type.

OutputDataType

Output data type.

isTranspose

Whether to enable input data transpose.

Table 2 API parameters

Parameter

Input/Output

Description

dst

Output

Destination operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

src

Input

Source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

offset

Input

Offset when the input data is dequantized.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

scale

Input

Scaling factor when the input data is dequantized.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

sharedTmpBuffer

Input

Temporary buffer.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

For details about how to obtain the temporary space size (BufferSize), see GetAscendAntiQuantMaxMinTmpSize.

K

Input

If isTranspose is set to true, the shape of src is [N, K]. If isTranspose is set to false, the shape of src is [K, N].

The parameter K corresponds to the K value.

shapeInfo

Input

Shape information of offset and scale. This parameter is configured only in the per_channel scenario.

This parameter is optional. In the per_channel scenario, if this parameter is not specified or the data in the struct is set to 0, the shape information of offset and scale is obtained from ShapeInfo.

The AntiQuantShapeInfo type is defined as follows:

1
2
3
4
5
6
struct AntiQuantShapeInfo {
    uint32_t offsetHeight{0};  // Offset height
    uint32_t offsetWidth{0};  // Offset width
    uint32_t scaleHeight{0};  // Scale height
    uint32_t scaleWidth{0};  // Scale width
};

Returns

None

Availability

Constraints

  • The source operand address must not overlap the destination operand address.
  • For details about the alignment requirements of the operand address offset, see General Restrictions.
  • The length of the data involved in computation of the input and output operands must be 32-byte aligned.
  • When inputs are transposed, K must be 32-byte aligned.
  • Before calling the API, ensure that the size of the input data is correct, and the size and shape of offset and scale are correct.

Example

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
#include "kernel_operator.h"

template <typename InputType, typename OutType>
class AntiQuantTest
{
public:
    __aicore__ inline AntiQuantTest() {}
    __aicore__ inline void Init(GM_ADDR dstGm, GM_ADDR srcGm, GM_ADDR offsetGm, GM_ADDR scaleGm,
                                uint32_t elementCountOfInput, uint32_t elementCountOfOffset, uint32_t K)
    {
        elementCountOfInput = elementCountOfInput;
        elementCountOfOffset = elementCountOfOffset;
        k = K;

        dstGlobal.SetGlobalBuffer((__gm__ OutType *)dstGm);
        srcGlobal.SetGlobalBuffer((__gm__ InputType *)srcGm);
        offsetGlobal.SetGlobalBuffer((__gm__ OutType *)offsetGm);
        scaleGlobal.SetGlobalBuffer((__gm__ OutType *)scaleGm);

        pipe.InitBuffer(queInSrc, 1, elementCountOfInput * sizeof(InputType));
        pipe.InitBuffer(queInOffset, 1, elementCountOfOffset * sizeof(OutType));
        pipe.InitBuffer(queInScale, 1, elementCountOfOffset * sizeof(OutType));
        pipe.InitBuffer(queOut, 1, elementCountOfInput * sizeof(OutType));
        pipe.InitBuffer(queTmp, 1, 67584);
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }

private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<InputType> srcLocal = queInSrc.AllocTensor<InputType>();
        AscendC::DataCopy(srcLocal, srcGlobal, elementCountOfInput);
        queInSrc.EnQue(srcLocal);

        AscendC::LocalTensor<OutType> offsetLocal = queInOffset.AllocTensor<OutType>();
        AscendC::DataCopy(offsetLocal, offsetGlobal, elementCountOfOffset);
        queInOffset.EnQue(offsetLocal);

        AscendC::LocalTensor<OutType> scaleLocal = queInScale.AllocTensor<OutType>();
        AscendC::DataCopy(scaleLocal, scaleGlobal, elementCountOfOffset);
        queInScale.EnQue(scaleLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<InputType> srcLocal = queInSrc.DeQue<InputType>();
        AscendC::LocalTensor<OutType> offsetLocal = queInOffset.DeQue<OutType>();
        AscendC::LocalTensor<OutType> scaleLocal = queInScale.DeQue<OutType>();
        AscendC::LocalTensor<OutType> dstLocal = queOut.AllocTensor<OutType>();
        AscendC::LocalTensor<uint8_t> sharedTmpBuffer = queTmp.AllocTensor<uint8_t>();

        AscendC::AntiQuantShapeInfo shapeInfo = {1, elementCountOfOffset, 1, elementCountOfOffset};
        AscendC::AscendAntiQuant<InputType, OutType, false>(dstLocal, srcLocal, offsetLocal, scaleLocal, sharedTmpBuffer, k, shapeInfo);
        queInSrc.FreeTensor(srcLocal);
        queInOffset.FreeTensor(offsetLocal);
        queInScale.FreeTensor(scaleLocal);
        queTmp.FreeTensor(sharedTmpBuffer);
        queOut.EnQue(dstLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<OutType> dstLocal = queOut.DeQue<OutType>();
        AscendC::DataCopy(dstGlobal, dstLocal, elementCountOfInput);
        queOut.FreeTensor(dstLocal);
    }

private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> queInSrc;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> queInOffset;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> queInScale;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> queTmp;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> queOut;
    AscendC::GlobalTensor<OutType> dstGlobal;
    AscendC::GlobalTensor<InputType> srcGlobal;
    AscendC::GlobalTensor<OutType> offsetGlobal;
    AscendC::GlobalTensor<OutType> scaleGlobal;
    uint32_t elementCountOfInput;
    uint32_t elementCountOfOffset;
    uint32_t k;
}; // class AntiQuantTest

extern "C" __global__ __aicore__ void kernel_anti_quant(GM_ADDR dst, GM_ADDR src, GM_ADDR offset, GM_ADDR scale,
                                                        uint32_t elementCountOfInput, uint32_t elementCountOfOffset, uint32_t K)
{
    AscendC::AntiQuantTest<InputType, OutType> op;
    op.Init(dst, src, offset, scale, elementCountOfInput, elementCountOfOffset, K);
    op.Process();
}
Result example:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
Input data src (shape: [2, 64], non-transpose scenario):
[1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]
offset (shape: [1, 64]):
[2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.
 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.]
scale (shape: [1, 64]):
[3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.
 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.
 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.]
Output data dstLocal (shape: [2, 64]):
[9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9.
 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9.
 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9.
 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9.
 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9.
 9. 9. 9. 9. 9. 9. 9. 9.]