Compare

Applicability

Product	Supported
Atlas A3 training products / Atlas A3 inference products	√
Atlas A2 training products / Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	√
Atlas inference product 's AI Core	√
Atlas inference product 's Vector Core	x
Atlas training products	√

Function

Compares the size of two tensors element-wise. If the comparison result is true, the corresponding bit of the output is set to 1; otherwise, it is set to 0.

The following comparison modes are supported:

LT: less than
GT: greater than

GE: greater than or equal to
EQ: equal to
NE: not equal to
LE: less than or equal to

Prototype

Computation of the entire tensor

        
             dst = src0 < src1;
dst = src0 > src1;
dst = src0 <= src1;
dst = src0 >= src1;
dst = src0 == src1;
dst = src0 != src1;

Currently, Atlas 200I/500 A2 inference products do not support overloading of operators involved in the calculation of an entire tensor.

Computation of the first n data elements of a tensor

        
             template <typename T, typename U>
__aicore__ inline void Compare(const LocalTensor<U>& dst, const LocalTensor<T>& src0, const LocalTensor<T>& src1, CMPMODE cmpMode, uint32_t count)

High-dimensional tensor sharding computation

Bitwise mask mode

          
               template <typename T, typename U, bool isSetMask = true>
__aicore__ inline void Compare(const LocalTensor<U>& dst, const LocalTensor<T>& src0, const LocalTensor<T>& src1, CMPMODE cmpMode, const uint64_t mask[], uint8_t repeatTime, const BinaryRepeatParams& repeatParams)

Contiguous mask mode

          
               template <typename T, typename U, bool isSetMask = true>
__aicore__ inline void Compare(const LocalTensor<U>& dst, const LocalTensor<T>& src0, const LocalTensor<T>& src1, CMPMODE cmpMode, const uint64_t mask, uint8_t repeatTime, const BinaryRepeatParams& repeatParams)

Parameters

**Table 1** Template parameters
Parameter	Description
T	Data type of the source operand. For Atlas A3 training products / Atlas A3 inference products , the supported data types are half (all comparison modes), float (all comparison modes), and int32_t (CMPMODE::EQ only). For Atlas A2 training products / Atlas A2 inference products , the supported data types are half (all comparison modes), float (all comparison modes), and int32_t (CMPMODE::EQ only). For Atlas 200I/500 A2 inference products , the supported data types are half and float. For the Atlas inference product 's AI Core, the supported data types are half and float. For Atlas training products , the supported data types are half and float.
U	Data type of the destination operand. For Atlas A3 training products / Atlas A3 inference products , the supported data types are int8_t and uint8_t. For Atlas A2 training products / Atlas A2 inference products , the supported data types are int8_t and uint8_t. For Atlas 200I/500 A2 inference products , the supported data types are int8_t and uint8_t. For the Atlas inference product 's AI Core, the supported data types are int8_t and uint8_t. For Atlas training products , the supported data types are int8_t and uint8_t.
isSetMask	Reserved. Retain the default value.

**Table 2** API parameters
Parameter	Input/Output	Description
dst	Output	Destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. dst stores the comparison results. The uint8_t data in dst is expanded bitwise. Bits from left to right indicate the element-wise comparison results between src0 and src1. If the comparison result is true, the corresponding bit is set to 1; otherwise, it is set to 0.
src0 and src1	Input	Source operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned.
cmpMode	Input	Comparison mode, including EQ, NE, GE, LE, GT, and LT. LT: src0 < src1 GT: src0 > src1 GE: src0 ≥ src1 EQ: src0 = src1 NE: src0 ≠ src1 LE: src0 ≤ src1
mask/mask[]	Input	mask is used to control the elements that participate in computation in each iteration. For Atlas A3 training products / Atlas A3 inference products , this parameter is reserved and has no effect. For Atlas A2 training products / Atlas A2 inference products , this parameter is reserved and has no effect. For Atlas 200I/500 A2 inference products , this parameter has effect. For the Atlas inference product 's AI Core, this parameter is reserved and has no effect. For Atlas training products , this parameter is reserved and has no effect. Contiguous mode: indicates the number of contiguous elements that participate in computation. The value range is related to the operand data type. The maximum number of elements that can be processed in each iteration varies according to the data type. When the operand is 16-bit, mask ∈ [1, 128]. When the operand is 32-bit, mask ∈ [1, 64]. Bit-wise mode: controls which elements are involved in computation bit by bit. A bit value of 1 means the corresponding element participates in computation, while 0 means it does not. The parameter is a uint64_t array with a length of 2 or 4. For example, if mask = [0, 8] and 8 = 0b1000, only the fourth element participates in computation. The value range is related to the operand data type. The maximum number of elements that can be processed in each iteration varies according to the data type. When the operand is 16-bit, mask[0]/mask[1] ∈ [0, 2⁶⁴ – 1] and cannot be 0 at the same time. When the operand is 32-bit, mask[1] = 0 and mask[0] ∈ (0, 2⁶⁴ – 1].
repeatTime	Input	Number of iteration repeats. The vector compute unit reads 256 bytes of contiguous data for computation each time. To process the input data, the data needs to be read and computed over multiple repeats. repeatTime indicates the number of repeats. For details about this parameter, see High-dimensional Sharding APIs.
repeatParams	Input	Parameters that control the operand address strides. They are of the BinaryRepeatParams type, and contain such parameters as those that specify the address stride of the operand for the same data block between adjacent iterations and address stride of the operand between different data blocks in a single iteration. For details about the address stride of the operand between adjacent iterations, see repeatStride. For details about the address stride of the operand between different data blocks in a single iteration, see dataBlockStride.
count	Input	Number of elements involved in the computation. When setting count, ensure that the memory occupied by count elements is 256-byte aligned.

Returns

None

Restrictions

For details about the operand address alignment requirements, see General Address Alignment Restrictions.

dst stores the comparison results of corresponding positions in src as binary values in little-endian order.
When using operator overloading for computations involving entire tensors, both src0 and src1 must be 256-byte aligned. When using APIs that compute the first n elements of a tensor, the count must be set such that the memory occupied by the count elements is 256-byte aligned.

Example

In this example, the source operands src0 and src1 each store 256 pieces of data of the float type. The example compares the data in src0 and src1 element by element. If the element in src0 is smaller than that in src1, the corresponding bit in the dst result is set to 1. Otherwise, the bit is set to 0. The dst result is stored in uint8_t format.

This example shows only part of the code used in the computation process (Compute). To run the sample code, copy the code snippet and replace some code of the Compute function in Template Sample.

Computation of the entire tensor
1

dstLocal = src0Local < src1Local;

Computation of the first n data elements of a tensor

        
             AscendC::Compare(dstLocal, src0Local, src1Local, AscendC::CMPMODE::LT, srcDataSize);

Template Sample

      
       
         
         
           #include "kernel_operator.h"
class KernelCmp {
public:
    __aicore__ inline KernelCmp() {}
    __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
    {
        src0Global.SetGlobalBuffer((__gm__ float*)src0Gm);
        src1Global.SetGlobalBuffer((__gm__ float*)src1Gm);
        dstGlobal.SetGlobalBuffer((__gm__ uint8_t*)dstGm);
        pipe.InitBuffer(inQueueSrc0, 1, srcDataSize * sizeof(float));
        pipe.InitBuffer(inQueueSrc1, 1, srcDataSize * sizeof(float));
        pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(uint8_t));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<float> src0Local = inQueueSrc0.AllocTensor<float>();
        AscendC::LocalTensor<float> src1Local = inQueueSrc1.AllocTensor<float>();
        AscendC::DataCopy(src0Local, src0Global, srcDataSize);
        AscendC::DataCopy(src1Local, src1Global, srcDataSize);
        inQueueSrc0.EnQue(src0Local);
        inQueueSrc1.EnQue(src1Local);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<float> src0Local = inQueueSrc0.DeQue<float>();
        AscendC::LocalTensor<float> src1Local = inQueueSrc1.DeQue<float>();
        AscendC::LocalTensor<uint8_t> dstLocal = outQueueDst.AllocTensor<uint8_t>();
 
        // Replace it with the actual interface Compare.
        AscendC::Compare(dstLocal, src0Local, src1Local, AscendC::CMPMODE::LT, srcDataSize);
 
        outQueueDst.EnQue<uint8_t>(dstLocal);
        inQueueSrc0.FreeTensor(src0Local);
        inQueueSrc1.FreeTensor(src1Local);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<uint8_t> dstLocal = outQueueDst.DeQue<uint8_t>();
        AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc0, inQueueSrc1;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<float> src0Global, src1Global;
    AscendC::GlobalTensor<uint8_t> dstGlobal;
    uint32_t srcDataSize = 256;
    uint32_t dstDataSize = srcDataSize / 8;
};
extern "C" __global__ __aicore__ void main_cpu_cmp_sel_demo(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm)
{
    KernelCmp op;
    op.Init(src0Gm, src1Gm, dstGm);
    op.Process();
}

          

        

      
     

Parent topic: Comparison and Selection