ReduceXorSum
Applicability
Product |
Supported |
|---|---|
√ |
|
√ |
|
x |
|
√ |
|
x |
|
x |
Function
Performs the XOR (bitwise XOR) operation by element and computes the sum of the results using ReduceSum.
If the final calculation result exceeds the int16 range [-32768, 32767], the output will be -32768 or 32767.
Prototype
- Pass to the temporary space through the sharedTmpBuffer input parameter.
1 2
template <typename T, bool isReuseSource = false> __aicore__ inline void ReduceXorSum(LocalTensor<T>& dstTensor, const LocalTensor<T>& src0Tensor, const LocalTensor<T>& src1Tensor, LocalTensor<uint8_t>& sharedTmpBuffer, const uint32_t calCount)
- Allocate the temporary space through the API framework.
1 2
template <typename T, bool isReuseSource = false> __aicore__ inline void ReduceXorSum(LocalTensor<T>& dstTensor, const LocalTensor<T>& src0Tensor, const LocalTensor<T>& src1Tensor, const uint32_t calCount);
Due to the internal implementation of this API, which requires the storage of XOR outcomes and the execution of other operations, additional temporary space is required to store intermediate variables generated during computation. The temporary space can be passed by developers through the sharedTmpBuffer input parameter or allocated through the API framework.
- When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables you to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.
- When the API framework is used for temporary space allocation, you do not need to allocate the space, but must reserve the required size for the temporary space.
If sharedTmpBuffer is used, you must allocate space for the tensor. If the API framework is used, you must reserve the temporary space. To obtain the size of the temporary space (BufferSize) to be reserved, use the API provided in GetReduceXorSumMaxMinTmpSize.
Parameters
Parameter |
Description |
|---|---|
T |
Data type of the operand. For the For the For the |
isReuseSource |
Whether the source operand can be modified. The default value is false. If you allow the source operand to be modified, enable this parameter to reduce memory space usage. If this parameter is set to true, the src0Tensor and src1Tensor memory space is reused during internal computation of this API to reduce memory space usage. If this parameter is set to false, the src0Tensor and src1Tensor memory space is not reused during internal computation of this API. For details about how to use isReuseSource, see Example 4. |
Parameter |
Input/Output |
Description |
|---|---|---|
dstTensor |
Output |
Destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The output value needs to be saved in a space with a size of sizeof(T). You need to allocate the actual buffer space to dstTensor based on this size and the framework's alignment requirements. NOTE:
The size of allocated buffer must be 32-byte aligned according to the framework's requirements. If the value of sizeof(T) is not 32-byte aligned, it should be rounded up to the nearest multiple of 32 bytes. The extra buffer space allocated for alignment purposes should not be filled with values, but rather left with random values. |
src0Tensor |
Input |
Source operand 0. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The source operand must have the same data type as the destination operand. |
src1Tensor |
Input |
Source operand 1. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The source operand must have the same data type as the destination operand. |
sharedTmpBuffer |
Input |
Temporary buffer. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. This parameter is used to store intermediate variables during ReduceXorSum computation and is provided by developers. For details about how to obtain the temporary space size (BufferSize), see GetReduceXorSumMaxMinTmpSize. |
calCount |
Input |
Number of elements involved in the computation. |
Returns
None
Restrictions
- For details about the operand address alignment requirements, see General Address Alignment Restrictions.
- The source operand address must not overlap the destination operand address.
- sharedTmpBuffer must not overlap the addresses of the source operand and destination operand.
- Ensure that calCount is less than or equal to the element range of src0Tensor and src1Tensor.
- If the final calculation result exceeds the int16 range [-32768, 32767], the output will be -32768 or 32767.
- For the
Atlas inference product 's AI Core, the intermediate computation data is stored in half type. The final result error is larger compared with other processors.
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | AscendC::TPipe pipe; AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueX; AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueY; AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueue; AscendC::TQue<AscendC::TPosition::VECCALC, 1> tmpQue; pipe.InitBuffer(inQueueX, 1, 32 * sizeof(int16_t)); pipe.InitBuffer(inQueueY, 1, 32 * sizeof(int16_t)); pipe.InitBuffer(outQueue, 1, 32); pipe.InitBuffer(tmpQue, 1, bufferSize); // bufferSize is obtained through the tiling parameter on the host. AscendC::LocalTensor<int16_t> dstLocal = outQueue.AllocTensor<int16_t>(); AscendC::LocalTensor<int16_t> src0Local = inQueueX.AllocTensor<int16_t>(); AscendC::LocalTensor<int16_t> src1Local = inQueueY.AllocTensor<int16_t>(); AscendC::LocalTensor<uint8_t> sharedTmpBuffer = tmpQue.AllocTensor<uint8_t>(); // The input buffer is not used. The input shape is 32. The input data type of the operator is int16_t. The first 32 elements are actually computed. AscendC::ReduceXorSum<int16_t, false>(dstLocal, src0Local, src1Local, sharedTmpBuffer, 32); |
1 2 3 4 5 | The input and output data type is int16_t. Input data (src0Local): [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] Input data (src1Local): [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] Output data (dstLocal): [32 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0] // Only 32 is a valid value. |