AscendAntiQuant
Function Description
Performs fake quantization by element. For example, apply fake quantization to convert the int8_t data type to the half data type. The calculation formulas are as follows:
- Per_channel scenario
Principles

The preceding figure shows the algorithm block diagram of AscendAntiQuant in typical scenarios. The computation process is divided into the following steps, all of which are performed on the Vector Core:
- Precision conversion: Convert the input src to the half type.
- Offset calculation: Perform Add calculation when offset is a vector, and perform Adds calculation when offset is a scalar.
- Scale calculation: Perform Mul calculation when scale is a vector, and perform Muls calculation when scale is a scalar.
Prototype
- Pass the temporary space through the sharedTmpBuffer input parameter.
- Per_channel scenario
1 2
template <typename InputDataType, typename OutputDataType, bool isTranspose> __aicore__ inline void AscendAntiQuant(const LocalTensor<OutputDataType> &dst, const LocalTensor<InputDataType> &src, const LocalTensor<OutputDataType> &offset, const LocalTensor<OutputDataType> &scale, const LocalTensor<uint8_t> &sharedTmpBuffer, const uint32_t K, const AntiQuantShapeInfo& shapeInfo = {})
- Per_channel scenario (without offset)
1 2
template <typename InputDataType, typename OutputDataType, bool isTranspose> __aicore__ inline void AscendAntiQuant(const LocalTensor<OutputDataType> &dst, const LocalTensor<InputDataType> &src, const LocalTensor<OutputDataType> &scale, const LocalTensor<uint8_t> &sharedTmpBuffer, const uint32_t K, const AntiQuantShapeInfo& shapeInfo = {})
- Per_tensor scenario
1 2
template <typename InputDataType, typename OutputDataType, bool isTranspose> __aicore__ inline void AscendAntiQuant(const LocalTensor<OutputDataType> &dst, const LocalTensor<InputDataType> &src, const OutputDataType offset, const OutputDataType scale, const LocalTensor<uint8_t> &sharedTmpBuffer, const uint32_t K, const AntiQuantShapeInfo& shapeInfo = {})
- Per_tensor scenario (without offset)
1 2
template <typename InputDataType, typename OutputDataType, bool isTranspose> __aicore__ inline void AscendAntiQuant(const LocalTensor<OutputDataType> &dst, const LocalTensor<InputDataType> &src, const OutputDataType scale, const LocalTensor<uint8_t> &sharedTmpBuffer, const uint32_t K, const AntiQuantShapeInfo& shapeInfo = {})
- Per_channel scenario
- Allocate the temporary space through the API framework.
- Per_channel scenario
1 2
template <typename InputDataType, typename OutputDataType, bool isTranspose> __aicore__ inline void AscendAntiQuant(const LocalTensor<OutputDataType> &dst, const LocalTensor<InputDataType> &src, const LocalTensor<OutputDataType> &offset, const LocalTensor<OutputDataType> &scale, const uint32_t K, const AntiQuantShapeInfo& shapeInfo = {})
- Per_tensor scenario
1 2
template <typename InputDataType, typename OutputDataType, bool isTranspose> __aicore__ inline void AscendAntiQuant(const LocalTensor<OutputDataType> &dst, const LocalTensor<InputDataType> &src, const OutputDataType offset, const OutputDataType scale, const uint32_t K, const AntiQuantShapeInfo& shapeInfo = {})
- Per_channel scenario
Due to the complex mathematical computation involved in the internal implementation of this API, additional temporary space is required to store intermediate variables generated during computation. The temporary space can be allocated through the API framework or passed by developers through the sharedTmpBuffer input parameter.
- When the API framework is used for temporary space allocation, developers do not need to allocate the space, but must reserve the required size for the space.
- When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.
If the API framework is used, developers must reserve the temporary space. If sharedTmpBuffer is used, developers must allocate space for sharedTmpBuffer. To obtain the size of the temporary space (BufferSize) to be reserved, use the API provided in GetAscendAntiQuantMaxMinTmpSize.
Parameters
Parameter |
Description |
|---|---|
InputDataType |
Input data type. |
OutputDataType |
Output data type. |
isTranspose |
Whether to enable input data transpose. |
Parameter |
Input/Output |
Description |
||
|---|---|---|---|---|
dst |
Output |
Destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. |
||
src |
Input |
Source operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. |
||
offset |
Input |
Offset when the input data is dequantized. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. |
||
scale |
Input |
Scaling factor when the input data is dequantized. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. |
||
sharedTmpBuffer |
Input |
Temporary buffer. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. For details about how to obtain the temporary space size (BufferSize), see GetAscendAntiQuantMaxMinTmpSize. |
||
K |
Input |
If isTranspose is set to true, the shape of src is [N, K]. If isTranspose is set to false, the shape of src is [K, N]. The parameter K corresponds to the K value. |
||
shapeInfo |
Input |
Shape information of offset and scale. This parameter is configured only in the per_channel scenario. This parameter is optional. In the per_channel scenario, if this parameter is not specified or the data in the struct is set to 0, the shape information of offset and scale is obtained from ShapeInfo. The AntiQuantShapeInfo type is defined as follows:
|
Returns
None
Availability
Constraints
- The source operand address must not overlap the destination operand address.
- For details about the alignment requirements of the operand address offset, see General Restrictions.
- The length of the data involved in computation of the input and output operands must be 32-byte aligned.
- When inputs are transposed, K must be 32-byte aligned.
- Before calling the API, ensure that the size of the input data is correct, and the size and shape of offset and scale are correct.
Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 | #include "kernel_operator.h" template <typename InputType, typename OutType> class AntiQuantTest { public: __aicore__ inline AntiQuantTest() {} __aicore__ inline void Init(GM_ADDR dstGm, GM_ADDR srcGm, GM_ADDR offsetGm, GM_ADDR scaleGm, uint32_t elementCountOfInput, uint32_t elementCountOfOffset, uint32_t K) { elementCountOfInput = elementCountOfInput; elementCountOfOffset = elementCountOfOffset; k = K; dstGlobal.SetGlobalBuffer((__gm__ OutType *)dstGm); srcGlobal.SetGlobalBuffer((__gm__ InputType *)srcGm); offsetGlobal.SetGlobalBuffer((__gm__ OutType *)offsetGm); scaleGlobal.SetGlobalBuffer((__gm__ OutType *)scaleGm); pipe.InitBuffer(queInSrc, 1, elementCountOfInput * sizeof(InputType)); pipe.InitBuffer(queInOffset, 1, elementCountOfOffset * sizeof(OutType)); pipe.InitBuffer(queInScale, 1, elementCountOfOffset * sizeof(OutType)); pipe.InitBuffer(queOut, 1, elementCountOfInput * sizeof(OutType)); pipe.InitBuffer(queTmp, 1, 67584); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<InputType> srcLocal = queInSrc.AllocTensor<InputType>(); AscendC::DataCopy(srcLocal, srcGlobal, elementCountOfInput); queInSrc.EnQue(srcLocal); AscendC::LocalTensor<OutType> offsetLocal = queInOffset.AllocTensor<OutType>(); AscendC::DataCopy(offsetLocal, offsetGlobal, elementCountOfOffset); queInOffset.EnQue(offsetLocal); AscendC::LocalTensor<OutType> scaleLocal = queInScale.AllocTensor<OutType>(); AscendC::DataCopy(scaleLocal, scaleGlobal, elementCountOfOffset); queInScale.EnQue(scaleLocal); } __aicore__ inline void Compute() { AscendC::LocalTensor<InputType> srcLocal = queInSrc.DeQue<InputType>(); AscendC::LocalTensor<OutType> offsetLocal = queInOffset.DeQue<OutType>(); AscendC::LocalTensor<OutType> scaleLocal = queInScale.DeQue<OutType>(); AscendC::LocalTensor<OutType> dstLocal = queOut.AllocTensor<OutType>(); AscendC::LocalTensor<uint8_t> sharedTmpBuffer = queTmp.AllocTensor<uint8_t>(); AscendC::AntiQuantShapeInfo shapeInfo = {1, elementCountOfOffset, 1, elementCountOfOffset}; AscendC::AscendAntiQuant<InputType, OutType, false>(dstLocal, srcLocal, offsetLocal, scaleLocal, sharedTmpBuffer, k, shapeInfo); queInSrc.FreeTensor(srcLocal); queInOffset.FreeTensor(offsetLocal); queInScale.FreeTensor(scaleLocal); queTmp.FreeTensor(sharedTmpBuffer); queOut.EnQue(dstLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<OutType> dstLocal = queOut.DeQue<OutType>(); AscendC::DataCopy(dstGlobal, dstLocal, elementCountOfInput); queOut.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::QuePosition::VECIN, 1> queInSrc; AscendC::TQue<AscendC::QuePosition::VECIN, 1> queInOffset; AscendC::TQue<AscendC::QuePosition::VECIN, 1> queInScale; AscendC::TQue<AscendC::QuePosition::VECOUT, 1> queTmp; AscendC::TQue<AscendC::QuePosition::VECOUT, 1> queOut; AscendC::GlobalTensor<OutType> dstGlobal; AscendC::GlobalTensor<InputType> srcGlobal; AscendC::GlobalTensor<OutType> offsetGlobal; AscendC::GlobalTensor<OutType> scaleGlobal; uint32_t elementCountOfInput; uint32_t elementCountOfOffset; uint32_t k; }; // class AntiQuantTest extern "C" __global__ __aicore__ void kernel_anti_quant(GM_ADDR dst, GM_ADDR src, GM_ADDR offset, GM_ADDR scale, uint32_t elementCountOfInput, uint32_t elementCountOfOffset, uint32_t K) { AscendC::AntiQuantTest<InputType, OutType> op; op.Init(dst, src, offset, scale, elementCountOfInput, elementCountOfOffset, K); op.Process(); } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | Input data src (shape: [2, 64], non-transpose scenario): [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] offset (shape: [1, 64]): [2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2. 2.] scale (shape: [1, 64]): [3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3. 3.] Output data dstLocal (shape: [2, 64]): [9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9. 9.] |