Xor

Function Description

Performs XOR operation element-wise. The concept and rule of XOR are as follows:

  • Concept: The XOR operation is a binary computation that applies to two data elements.
  • Rule: 0^0 = 0; 0^1 = 1; 1^0 = 1; 1^1 = 0. Specifically, if the bits of two objects involved in computation are different, one bit is 1 and the other bit is 0. If input bits are the same, then the output will be 0 else 1.
For example, 3^5 = 6 equals 0000 0011^0000 0101 = 0000 0110.

Prototype

  • Pass the temporary space through the sharedTmpBuffer input parameter.
    • All or part of the source operand tensors are involved in computation.
      1
      2
      template <typename T, bool isReuseSource = false>
      __aicore__ inline void Xor(const LocalTensor<T>& dstTensor, const LocalTensor<T>& src0Tensor, const LocalTensor<T>& src1Tensor, const LocalTensor<uint8_t>& sharedTmpBuffer, const uint32_t calCount)
      
    • All source operand tensors are involved in computation.
      1
      2
      template <typename T, bool isReuseSource = false>
      __aicore__ inline void Xor(const LocalTensor<T>& dstTensor, const LocalTensor<T> &src0Tensor, const LocalTensor<T> &src1Tensor, const LocalTensor<uint8_t>& sharedTmpBuffer)
      
  • Allocate the temporary space through the API framework.
    • All or part of the source operand tensors are involved in computation.
      1
      2
      template <typename T, bool isReuseSource = false>
      __aicore__ inline void Xor(const LocalTensor<T> &dstTensor, const LocalTensor<T> &src0Tensor, const LocalTensor<T> &src1Tensor, const uint32_t calCount)
      
    • All source operand tensors are involved in computation.
      1
      2
      template <typename T, bool isReuseSource = false>
      __aicore__ inline void Xor(const LocalTensor<T> &dstTensor, const LocalTensor<T> &src0Tensor, const LocalTensor<T> &src1Tensor)
      

Due to the complex mathematical computation involved in the internal implementation of this API, additional temporary space is required to store intermediate variables generated during computation. The temporary space can be passed by developers through the sharedTmpBuffer input parameter or allocated through the API framework.

  • When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.
  • When the API framework is used for temporary space allocation, developers do not need to allocate the space, but must reserve the required size for the space.

If sharedTmpBuffer is used, developers must allocate space for the tensor. If the API framework is used, developers must reserve the temporary space. To obtain the size of the temporary space (BufferSize) to be reserved, use the API provided in GetXorMaxMinTmpSize.

Parameters

Table 1 Parameters in the template

Parameter

Description

T

Data type of the operand.

isReuseSource

Whether the source operand can be modified. This parameter is reserved. Pass the default value false.

Table 2 API parameters

Parameter

Input/Output

Description

dstTensor

Output

Destination operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

src0Tensor

Input

Source operand 0.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The source operand must have the same data type as the destination operand.

src1Tensor

Input

Source operand 1.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The source operand must have the same data type as the destination operand.

sharedTmpBuffer

Input

Temporary buffer.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

This parameter is used to store intermediate variables during complex computation in Xor and is provided by developers.

For details about how to obtain the temporary space size (BufferSize), see GetXorMaxMinTmpSize.

calCount

Input

Number of actually computed data elements. The value range is [0, min(src0Tensor.GetSize(), src1Tensor.GetSize())].

Returns

None

Availability

Constraints

  • The source operand address must not overlap the destination operand address.
  • Currently, only the ND format is supported.
  • Ensure that calCount is less than or equal to the element range stored in src0Tensor, src1Tensor, and dstTensor.
  • For APIs without calCount, ensure that the shape sizes of src0Tensor and src1Tensor are the same.
  • sharedTmpBuffer must not overlap the addresses of the source operand and destination operand.
  • For details about the alignment requirements of the operand address offset, see General Restrictions.

Example

xor_custom.cpp on the kernel:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
#include "kernel_operator.h"

constexpr int32_t BUFFER_NUM = 1;
class KernelXor {
public:
    __aicore__ inline KernelXor() {}
    __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, GM_ADDR z, uint32_t totalLength, uint32_t totalLength2, uint32_t tilenum, uint32_t tmpSize, uint32_t mcount)
    {
        this->totalLength = totalLength;
        this->blockLength = totalLength / AscendC::GetBlockNum();
        this->blockLength2 = totalLength2 / AscendC::GetBlockNum();
        this->tilenum = tilenum;
        this->tmpSize = tmpSize;
        this->mcount = mcount;
        this->tileLength = this->blockLength / tilenum / BUFFER_NUM;
        this->tileLength2 = this->blockLength2 / tilenum / BUFFER_NUM;

        xGm.SetGlobalBuffer((__gm__ int16_t *)x + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);
        yGm.SetGlobalBuffer((__gm__ int16_t *)y + this->blockLength2 * AscendC::GetBlockIdx(), this->blockLength2);
        zGm.SetGlobalBuffer((__gm__ int16_t *)z + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);

        if (this->tmpSize != 0) {
            pipe.InitBuffer(tmpQueue, BUFFER_NUM, this->tmpSize);
        }
        pipe.InitBuffer(inQueueX, BUFFER_NUM, this->tileLength * sizeof(int16_t));
        pipe.InitBuffer(inQueueY, BUFFER_NUM, this->tileLength2 * sizeof(int16_t));
        pipe.InitBuffer(outQueueZ, BUFFER_NUM, this->tileLength * sizeof(int16_t));
    }
    __aicore__ inline void Process()
    {
        int32_t loopCount = this->tilenum * BUFFER_NUM;
        for (int32_t i = 0; i < loopCount; i++) {
            CopyIn(i);
            Compute(i);
            CopyOut(i);
        }
    }

private:
    __aicore__ inline void CopyIn(int32_t progress)
    {
        AscendC::LocalTensor<int16_t> xLocal = inQueueX.AllocTensor<int16_t>();
        AscendC::DataCopy(xLocal, xGm[progress * this->tileLength], this->tileLength);
        inQueueX.EnQue(xLocal);
        AscendC::LocalTensor<int16_t> yLocal = inQueueY.AllocTensor<int16_t>();
        AscendC::DataCopy(yLocal, yGm[progress * this->tileLength2], this->tileLength2);
        inQueueY.EnQue(yLocal);
    }
    __aicore__ inline void Compute(int32_t progress)
    {
        AscendC::LocalTensor<int16_t> xLocal = inQueueX.DeQue<int16_t>();
        AscendC::LocalTensor<int16_t> yLocal = inQueueY.DeQue<int16_t>();
        AscendC::LocalTensor<int16_t> zLocal = outQueueZ.AllocTensor<int16_t>();
        if (this->tmpSize != 0) {
            AscendC::LocalTensor<uint8_t> tmpLocal = tmpQueue.AllocTensor<uint8_t>();
            if (this->mcount != this->totalLength) {
                AscendC::Xor(zLocal, xLocal, yLocal, tmpLocal, this->mcount);
            } else {
                AscendC::Xor(zLocal, xLocal, yLocal, tmpLocal);
            }
            tmpQueue.FreeTensor(tmpLocal);
        } else {
            if (this->mcount != this->totalLength) {
                AscendC::Xor(zLocal, xLocal, yLocal, this->mcount);
            } else {
                AscendC::Xor(zLocal, xLocal, yLocal);
            }
        }
        outQueueZ.EnQue<int16_t>(zLocal);
        inQueueX.FreeTensor(xLocal);
        inQueueY.FreeTensor(yLocal);
    }
    __aicore__ inline void CopyOut(int32_t progress)
    {
        AscendC::LocalTensor<int16_t> zLocal = outQueueZ.DeQue<int16_t>();
        AscendC::DataCopy(zGm[progress * this->tileLength], zLocal, this->tileLength);
        outQueueZ.FreeTensor(zLocal);
    }

private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueueX;
    AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueueY;
    AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> tmpQueue;
    AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> outQueueZ;
    AscendC::GlobalTensor<int16_t> xGm;
    AscendC::GlobalTensor<int16_t> yGm;
    AscendC::GlobalTensor<int16_t> zGm;
    uint32_t blockLength;
    uint32_t blockLength2;
    uint32_t tilenum;
    uint32_t tileLength;
    uint32_t tileLength2;
    uint32_t tmpSize;
    uint32_t mcount;
    uint32_t totalLength;
};

extern "C" __global__ __aicore__ void xor_custom(GM_ADDR x, GM_ADDR y, GM_ADDR z, GM_ADDR workspace, GM_ADDR tiling)
{
    GET_TILING_DATA(tilingData, tiling);
    KernelXor op;
    op.Init(x, y, z, tilingData.totalLength, tilingData.totalLength2, tilingData.tilenum, tilingData.tmpSize, tilingData.mcount);
    if (TILING_KEY_IS(1)) {
        op.Process();
    }
}

xor_custom_tiling.h on the host:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
#include "register/op_def_registry.h"
#include "register/tilingdata_base.h"
namespace optiling {
  BEGIN_TILING_DATA_DEF(XorCustomTilingData)
  TILING_DATA_FIELD_DEF(uint32_t, totalLength);
  TILING_DATA_FIELD_DEF(uint32_t, totalLength2);
  TILING_DATA_FIELD_DEF(uint32_t, tmpSize);
  TILING_DATA_FIELD_DEF(uint32_t, tilenum);
  TILING_DATA_FIELD_DEF(uint32_t, mcount);
  END_TILING_DATA_DEF;
  REGISTER_TILING_DATA_CLASS(XorCustom, XorCustomTilingData)
}

xor_custom.cpp on the host:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
#include "xor_custom_tiling.h"
#include "register/op_def_registry.h"
#include "tiling/tiling_api.h"

namespace optiling
{
    static ge::graphStatus TilingFunc(gert::TilingContext *context)
    {
        XorCustomTilingData tiling;
        const gert::RuntimeAttrs *xorAttrs = context->GetAttrs();
        const uint32_t tilenum = *(xorAttrs->GetAttrPointer<uint32_t>(0));
        const uint32_t blockdim = *(xorAttrs->GetAttrPointer<uint32_t>(1));
        const uint32_t sizeflag = *(xorAttrs->GetAttrPointer<uint32_t>(2));
        const uint32_t countflag = *(xorAttrs->GetAttrPointer<uint32_t>(3));
        uint32_t totalLength = context->GetInputTensor(0)->GetShapeSize();
        uint32_t totalLength2 = context->GetInputTensor(1)->GetShapeSize();
        context->SetBlockDim(blockdim);
        tiling.set_totalLength(totalLength);
        tiling.set_totalLength2(totalLength2);
        tiling.set_tilenum(tilenum);

        if (countflag == 0) {
            tiling.set_mcount(totalLength2);
        } else if (countflag == 1) {
            tiling.set_mcount(totalLength);
        }

        std::vector<int64_t> shapeVec = {totalLength};
        ge::Shape srcShape(shapeVec);
        uint32_t typeSize = sizeof(int16_t);
        uint32_t maxValue = 0;
        uint32_t minValue = 0;
        bool isReuseSource = false;
        AscendC::GetXorMaxMinTmpSize(srcShape, typeSize, isReuseSource, maxValue, minValue);
  //sizeflag 0: minimum tempBuffer is used; 1: maximum tempBuffer is used.
        if (sizeflag == 0) {
            tiling.set_tmpSize(minValue);
        } else if (sizeflag == 1) {
            tiling.set_tmpSize(maxValue);
        } else if (sizeflag == 2) {
            tiling.set_tmpSize(0);
        }
        tiling.SaveToBuffer(context->GetRawTilingData()->GetData(), context->GetRawTilingData()->GetCapacity());
        context->GetRawTilingData()->SetDataSize(tiling.GetDataSize());
        context->SetTilingKey(1);
        size_t *currentWorkspace = context->GetWorkspaceSizes(1);
        currentWorkspace[0] = 0;
        return ge::GRAPH_SUCCESS;
    }
}
namespace ge
{
    static ge::graphStatus InferShape(gert::InferShapeContext *context)
    {
        const gert::Shape *xShape = context->GetInputShape(0);
        gert::Shape *yShape = context->GetOutputShape(0);
        *yShape = *xShape;
        return GRAPH_SUCCESS;
    }
}
namespace ops
{
    class XorCustom : public OpDef
    {
    public:
        explicit XorCustom(const char *name) : OpDef(name)
        {
            this->Input("x")
                .ParamType(REQUIRED)
                .DataType({ge::DT_INT16})
                .Format({ge::FORMAT_ND});
            this->Input("y")
                .ParamType(REQUIRED)
                .DataType({ge::DT_INT16})
                .Format({ge::FORMAT_ND});
            this->Output("z")
                .ParamType(REQUIRED)
                .DataType({ge::DT_INT16})
                .Format({ge::FORMAT_ND});
            this->SetInferShape(ge::InferShape);
            this->Attr("tilenum")
                .AttrType(REQUIRED)
                .Int(0);
            this->Attr("blockdim")
                .AttrType(REQUIRED)
                .Int(0);
            this->Attr("sizeflag")
                .AttrType(REQUIRED)
                .Int(0);
            this->Attr("countflag")
                .AttrType(REQUIRED)
                .Int(0);
            this->AICore()
                .SetTiling(optiling::TilingFunc);
            this->AICore().AddConfig("ascendxxx"); // Replace ascendxxx with the actual Ascend AI Processor model.
        }
    };
    OP_ADD(XorCustom);
} // namespace ops
Result example:
The input and output data types are int16_t. A one-dimensional vector contains 32 data elements.Assume that XOR is performed on the first data in the vector: (-5753) xor 18745 = -24386
Input data (src0Local): [-5753 28501 20334 -5845  ... -20817 3403 21261 22241]
Input data (src1Local): [18745 -24448 20873 10759 ... 21940 -26342 9251 31019]
Output data (dstLocal): [-24386 -12331 7911 -15572 ... -1253 -27567 30510 12234]