CompareScalar

Applicability

Product	Supported
Atlas A3 training products / Atlas A3 inference products	√
Atlas A2 training products / Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	√
Atlas inference product 's AI Core	√
Atlas inference product 's Vector Core	x
Atlas training products	x

Function

Compares each element of a tensor with a scalar. If the comparison result is true, the corresponding bit in the output is set to 1; otherwise, it is set to 0.

The following comparison modes are supported:

LT: less than
GT: greater than

GE: greater than or equal to
EQ: equal to
NE: not equal to
LE: less than or equal to

Prototype

Computation of the first n data elements of a tensor

        
             template <typename T, typename U>
__aicore__ inline void CompareScalar(const LocalTensor<U>& dst, const LocalTensor<T>& src0, const T src1Scalar, CMPMODE cmpMode, uint32_t count)

High-dimensional tensor sharding computation

Bitwise mask mode

          
               template <typename T, typename U, bool isSetMask = true>
__aicore__ inline void CompareScalar(const LocalTensor<U>& dst, const LocalTensor<T>& src0, const T src1Scalar, CMPMODE cmpMode, const uint64_t mask[], uint8_t repeatTime, const UnaryRepeatParams& repeatParams)

Contiguous mask mode

          
               template <typename T, typename U, bool isSetMask = true>
__aicore__ inline void CompareScalar(const LocalTensor<U>& dst, const LocalTensor<T>& src0, const T src1Scalar, CMPMODE cmpMode, const uint64_t mask, uint8_t repeatTime, const UnaryRepeatParams& repeatParams)

Parameters

**Table 1** Template parameters
Parameter	Description
T	Data type of the source operand.
U	Data type of the destination operand.
isSetMask	Indicates whether to set mask inside the API. true: sets mask inside the API. false: sets mask outside the API. Developers need to use the SetVectorMask API to set the mask value. In this mode, the mask value in the input parameter of this API must be set to the placeholder MASK_PLACEHOLDER.

**Table 2** API parameters
Parameter	Input/Output	Meaning
dst	Output	Destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. dst stores the comparison results. The uint8_t data in dst is expanded bitwise. Bits from left to right indicate the element-wise comparison results between src0 and src1. If the comparison result is true, the corresponding bit is set to 1; otherwise, it is set to 0. For the Atlas A3 training products / Atlas A3 inference products , the supported data type is uint8_t. For the Atlas A2 training products / Atlas A2 inference products , the supported data type is uint8_t. For Atlas 200I/500 A2 inference products , the supported data type is uint8_t. For the Atlas inference product 's AI Core, the supported data type is uint8_t.
src0	Input	Source operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. For Atlas A3 training products / Atlas A3 inference products , the supported data types are half, float (all CMPMODE options are supported), and int32_t (only CMPMODE::EQ is supported). For the Atlas A2 training products / Atlas A2 inference products , the supported data types are half, float, and int32_t (only CMPMODE::EQ is supported). For Atlas 200I/500 A2 inference products , the supported data types are int16_t, uint16_t, half, float, int32_t, and uint32_t. For the Atlas inference product 's AI Core, the supported data types are half and float.
src1Scalar	Input	Source operand (scalar). Its data type must be the same as that of src0.
cmpMode	Input	Comparison mode, which can be EQ, NE, GE, LE, GT, and LT. LT: src0 < src1 GT: src0 > src1 GE: src0 ≥ src1 EQ: src0 = src1 NE: src0 ≠ src1 LE: src0 ≤ src1
mask/mask[]	Input	mask is used to control the elements that participate in computation in each iteration. For the Atlas A3 training products / Atlas A3 inference products , this parameter is reserved. Setting this parameter is invalid. For the Atlas A2 training products / Atlas A2 inference products , this parameter is reserved. Setting this parameter is invalid. For the Atlas 200I/500 A2 inference products , setting this parameter is valid. For the Atlas inference product 's AI Core, this parameter is reserved. Setting this parameter is invalid. Contiguous mode: indicates the number of contiguous elements that participate in computation. The value range is related to the operand data type. The maximum number of elements that can be processed in each repeat varies according to the data type. When the operand is 16-bit, mask ∈ [1, 128]. When the operand is 32-bit, mask ∈ [1, 64]. Bitwise mode: controls which elements are involved in computation bit by bit. A bit value of 1 means the corresponding element participates in computation, while 0 means it does not. The parameter is a uint64_t array with a length of 2 or 4. For example, if mask = [0, 8] and 8 = 0b1000, only the fourth element participates in computation. The value range is related to the operand data type. The maximum number of elements that can be processed in each repeat varies according to the data type. When the operand is 16-bit, mask[0]/mask[1] ∈ [0, 2⁶⁴ – 1] and cannot be 0 at the same time. When the operand is 32-bit, mask[1] = 0 and mask[0] ∈ (0, 2⁶⁴ – 1].
repeatTime	Input	Number of repeat iterations. The vector compute unit reads 256 bytes of contiguous data for computation each time. To process the input data, the data needs to be read and computed over multiple repeats. repeatTime indicates the number of repeats. For details about this parameter, see High-dimensional Sharding APIs.
repeatParams	Input	Parameters that control the operand address strides. This parameter is of the UnaryRepeatParams type, including the address stride of the same DataBlock between adjacent iterations of the operand and the address stride of different DataBlocks within the same iteration of the operand. For details about the address stride of the operand between adjacent iterations, see repeatStride. For details about the address stride of the operand between different data blocks in a single iteration, see dataBlockStride.
count	Input	Number of elements involved in the computation. When setting count, ensure that the memory occupied by count elements is 256-byte aligned.

Returns

None

Restrictions

For details about the operand address alignment requirements, see General Address Alignment Restrictions.

dst stores the comparison results of corresponding positions in src as binary values in little-endian order.
If the API for computing the first n data elements of a tensor is used, ensure that the memory occupied by count elements is 256-byte aligned.

Example

In this example, the source operand src0Local stores 256 data elements of the float type. It compares each element in src0Local with the value from src1Local.GetValue(0). The corresponding bit in dstLocal is set to 1 if the src0Local element is smaller than that in src1Local.GetValue(0); otherwise, it is set to 0. The dst result is stored in uint8_t format.

Example of API for computing the first n data elements of a tensor

        
             AscendC::CompareScalar(dstLocal, src0Local, src1Scalar, AscendC::CMPMODE::LT, srcDataSize);

High-dimensional tensor sharding computation (contiguous mask mode)

        
             uint64_t mask = 256 / sizeof(float); // 256 indicates the number of bytes processed in each iteration.
int repeat = 4;
AscendC::UnaryRepeatParams repeatParams = { 1, 1, 8, 8 };
// repeat = 4, 64 elements one repeat, 256 elements total
// dstBlkStride, srcBlkStride = 1, no gap between blocks in one repeat
// dstRepStride, srcRepStride = 8, no gap between repeats
AscendC::CompareScalar(dstLocal, src0Local, src1Scalar, AscendC::CMPMODE::LT, mask, repeat, repeatParams);

High-dimensional tensor sharding computation (bitwise mask mode)

        
             uint64_t mask[2] = { UINT64_MAX, 0};
int repeat = 4;
AscendC::UnaryRepeatParams repeatParams = { 1, 1, 8, 8 };
// repeat = 4, 64 elements one repeat, 256 elements total
// srcBlkStride, = 1, no gap between blocks in one repeat
// dstRepStride, srcRepStride = 8, no gap between repeats
AscendC::CompareScalar(dstLocal, src0Local, src1Scalar, AscendC::CMPMODE::LT, mask, repeat, repeatParams);

Result example:

Input (src0_gm):
[ 16.604824    45.069473    65.108345   -59.68792     21.043684
  75.90726    -27.046307   -40.10546     -5.933778    83.56574
  58.87062    -12.77814     28.17882     62.549377   -22.310246
 -67.69001     81.06072     69.988945    69.10082     -6.667376
  96.20256     18.532446   -66.56364    -32.531246    49.980835
  35.668995   -16.847628     1.3236234   10.0143795   43.878166
  26.628105    31.774637    47.9279      79.7291     -54.09651
  95.49459    -18.404795   -86.84594      9.406091   -79.54437
   0.49116692 -48.151714   -12.97062    -99.89055     23.475513
 -27.366564   -69.229675    83.613304    52.14729     40.98426
 -23.422009   -53.386215     1.6576616  -62.36946     54.693733
  66.2058      -4.0042257  -25.351263     1.0000885   -6.458584
  25.447659    71.647316    82.31162     -7.7359715   28.107353
 -79.22045     20.292479    67.7434     -76.054085    -7.754251
  38.632687    -4.8460293  -69.791954   -57.574455   -99.96178
 -73.29611    -68.57477     98.200035   -55.30482    -55.590027
  79.53274     -1.862139   -37.60953    -12.225406   -35.2875
 -24.047668   -66.07609     21.9362      80.603516    28.928387
  26.579298    97.6649      78.94723    -89.86824     73.29788
  18.957182   -73.87053    -23.508097   -51.02931     39.158726
 -96.61422    -41.192455    54.973663    47.58695     -3.9818003
 -81.05088    -67.62415    -17.491713   -34.916042   -95.993744
  -3.4719822  -55.956417     6.223455    12.240832    15.055512
  94.70584    -13.33949    -50.46866     54.612816   -28.521824
 -87.63997     59.53054     41.000504   -31.266075   -31.419422
 -32.940186    53.449913    50.012768   -13.663364    40.931725
 -68.80396    -86.63726     76.866585   -83.76385      3.7227867
  58.443035   -74.333046   -92.52674     24.249512    -7.935491
  24.197245   -34.85033     67.854645    72.65312     13.622443
 -70.94266     15.401667    -9.332295   -86.61463     72.659676
 -83.63352      9.279887    81.037964    46.285606   -12.967846
 -48.72901     69.07614    -40.355286   -94.257034   -45.514374
  24.966864    -9.657219    61.803864   -83.09603     77.769035
 -97.44226    -89.71987    -53.969315    43.892918    73.88798
  67.23104     36.65282    -93.70069    -87.48934    -27.679005
 -36.825226   -30.117033   -41.579655   -97.325325    77.1972
 -49.883194    33.061394   -63.844925    89.74327     64.549416
  80.16943     73.26347    -87.307175   -96.62777     81.8532
   7.5365276   28.357092    59.896378   -15.95738    -77.42723
   0.03529428 -20.263502    45.59324    -90.160835    89.478004
  57.608685    60.71819     45.8125      39.94484    -48.77375
 -56.897358     5.2580256   -6.937905   -49.80309    -42.527523
  72.91772     89.53271    -62.181187    18.490683   -69.40782
   6.141204    13.938042    75.312515    21.766457    -8.157599
  55.53147    -30.789118   -12.087165    82.435684    23.4884
  82.73172     -2.026827    -8.124383   -10.707488   -74.32759
 -54.702602    14.209252    93.73145     98.93554     52.803623
  32.200726    41.823833    90.193756   -34.512424   -85.64022
  97.47763     33.353424    94.84875     23.03139     99.97347
 -72.47978     19.51753    -88.28579    -88.70721    -18.659292
 -79.5277      62.90431     21.837631    45.989056    -9.62086
  11.4855795 ]
Input (src1_gm):
[-95.16087   -71.4676     51.817818  -12.358237   96.60704   -12.0067835
 -44.128048    7.5811195  84.61196   -60.303513   21.470125   98.96244
  18.262054   80.014244   48.37233   -75.03457  ]
Output (dst_gm):
[ 0  0  0  0  0  8  0  0  0  4  0  0 16 32  0  0  0  0  0  0 32  0  4 16
  0  0  0  0  0  0  0  0]

Template Sample

      
       
         
         
           #include "kernel_operator.h"
template <typename T> class KernelCmp {
public:
    __aicore__ inline KernelCmp() {}
    __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm,
        uint32_t dataSize, AscendC::CMPMODE mode)
    {
        srcDataSize = dataSize;
        dstDataSize = srcDataSize / 8;
        cmpMode = mode;
        src0Global.SetGlobalBuffer((__gm__ T*)src0Gm);
        src1Global.SetGlobalBuffer((__gm__ T*)src1Gm);
        dstGlobal.SetGlobalBuffer((__gm__ uint8_t*)dstGm);
        pipe.InitBuffer(inQueueSrc0, 1, srcDataSize * sizeof(T));
        pipe.InitBuffer(inQueueSrc1, 1, 16 * sizeof(T));
        pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(uint8_t));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<T> src0Local = inQueueSrc0.AllocTensor<T>();
        AscendC::LocalTensor<T> src1Local = inQueueSrc1.AllocTensor<T>();
        AscendC::DataCopy(src0Local, src0Global, srcDataSize);
        AscendC::DataCopy(src1Local, src1Global, 16);
        inQueueSrc0.EnQue(src0Local);
        inQueueSrc1.EnQue(src1Local);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<T> src0Local = inQueueSrc0.DeQue<T>();
        AscendC::LocalTensor<T> src1Local = inQueueSrc1.DeQue<T>();
        AscendC::LocalTensor<uint8_t> dstLocal = outQueueDst.AllocTensor<uint8_t>();
        AscendC::PipeBarrier<PIPE_ALL>();
        T src1Scalar = src1Local.GetValue(0);
        AscendC::PipeBarrier<PIPE_ALL>();
        AscendC::CompareScalar(dstLocal, src0Local, static_cast<T>(src1Scalar), cmpMode, srcDataSize);
        outQueueDst.EnQue<uint8_t>(dstLocal);
        inQueueSrc0.FreeTensor(src0Local);
        inQueueSrc1.FreeTensor(src1Local);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<uint8_t> dstLocal = outQueueDst.DeQue<uint8_t>();
        AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc0, inQueueSrc1;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<T> src0Global, src1Global;
    AscendC::GlobalTensor<uint8_t> dstGlobal;
    uint32_t srcDataSize = 0;
    uint32_t dstDataSize = 0;
    AscendC::CMPMODE cmpMode;
};
template <typename T>
__aicore__ void main_cpu_cmp_sel_demo(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm, uint32_t dataSize, AscendC::CMPMODE mode)
{
    KernelCmp<T> op;
    op.Init(src0Gm, src1Gm, dstGm, dataSize, mode);
    op.Process();
}
extern "C" __global__ __aicore__ void kernel_vec_compare_scalar_256_LT_float(GM_ADDR src0_gm, GM_ADDR src1_gm, GM_ADDR dst_gm)
{
    main_cpu_cmp_sel_demo<float>(src0_gm, src1_gm, dst_gm, 256, AscendC::CMPMODE::LT);
}

          

        

      
     

Parent topic: Comparison and Selection