CompareScalar

Applicability

Product

Supported/Unsupported

Atlas A3 training products / Atlas A3 inference products

Atlas A2 training products / Atlas A2 inference products

Atlas 200I/500 A2 inference products

Atlas inference product 's AI Core

Atlas inference product 's Vector Core

x

Atlas training products

x

Functions

Compares the sizes of elements in a tensor with those in a scalar. If the comparison result is true, the corresponding bit of the output result is 1. Otherwise, the bit is 0.

The following comparison modes are supported:

  • LT: less than
  • GT: greater than
  • GE: greater than or equal to (greater than or equal to)
  • EQ: equal to
  • NE: not equal to
  • LE: less than or equal to (less than or equal to)

Prototype

  • Computation of the first n data elements of a tensor
    1
    2
    template <typename T, typename U>
    __aicore__ inline void CompareScalar(const LocalTensor<U>& dst, const LocalTensor<T>& src0, const T src1Scalar, CMPMODE cmpMode, uint32_t count)
    
  • High-dimensional tensor sharding computation
    • Bitwise mask mode
      1
      2
      template <typename T, typename U, bool isSetMask = true>
      __aicore__ inline void CompareScalar(const LocalTensor<U>& dst, const LocalTensor<T>& src0, const T src1Scalar, CMPMODE cmpMode, const uint64_t mask[], uint8_t repeatTime, const UnaryRepeatParams& repeatParams)
      
    • Contiguous mask mode
      1
      2
      template <typename T, typename U, bool isSetMask = true>
      __aicore__ inline void CompareScalar(const LocalTensor<U>& dst, const LocalTensor<T>& src0, const T src1Scalar, CMPMODE cmpMode, const uint64_t mask, uint8_t repeatTime, const UnaryRepeatParams& repeatParams)
      

Parameters

Table 1 Parameters in the template

Parameter

Description

T

Data type of the source operand.

U

Data type of the destination operand.

isSetMask

Indicates whether to set mask inside the API.

  • true: sets mask inside the API.
  • false: sets mask outside the API. Developers need to use the SetVectorMask API to set the mask value. In this mode, the mask value in the input parameter of this API must be set to the placeholder MASK_PLACEHOLDER.
Table 2 API parameters

Parameter

Input/Output

Meaning

dst

Output

Destination operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The start address of the LocalTensor must be 32-byte aligned.

dstLocal is used to store the comparison result. The bits of uint8_t data in dstLocal from left to right indicate the comparison result of src0 and src1. If the comparison result is true, the corresponding bit value is 1. Otherwise, the value is 0.

For the Atlas A3 training products / Atlas A3 inference products , the supported data type is uint8_t.

For the Atlas A2 training products / Atlas A2 inference products , the supported data type is uint8_t.

For the Atlas A3 training products / Atlas A3 inference products , the supported data type is uint8_t.

For the Atlas inference product 's AI Core, the supported data type is uint8_t.

src0

Input

Source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The start address of the LocalTensor must be 32-byte aligned.

For the Atlas A3 training products / Atlas A3 inference products , the supported data types are half, float, and int32_t (only CMPMODE::EQ is supported).

For the Atlas A2 training products / Atlas A2 inference products , the supported data types are half, float, and int32_t (only CMPMODE::EQ is supported).

For the Atlas 200I/500 A2 inference products , the supported data types are int16_t, uint16_t, half, float, int32_t, and uint32_t.

For the Atlas inference product 's AI Core, the supported data types are half and float.

src1Scalar

Input

Source operand (scalar). The data type is the same as that of src0.

cmpMode

Input

Comparison mode, including EQ, NE, GE, LE, GT, and LT.

  • LT: src0 < src1
  • GT: src0 > src1
  • GE: src0 is greater than or equal to (greater than or equal to) src1.
  • EQ: src0 is equal to src1.
  • NE: src0 is not equal to src1.
  • LE: src0 is less than or equal to (less than or equal to) src1.

mask/mask[]

Input

The mask parameter is used to control the elements involved in computation in each iteration.

For the Atlas A3 training products / Atlas A3 inference products , this parameter is reserved. Setting this parameter is invalid.

For the Atlas A2 training products / Atlas A2 inference products , this parameter is reserved. Setting this parameter is invalid.

For the Atlas 200I/500 A2 inference products , setting this parameter is valid.

For the Atlas inference product 's AI Core, this parameter is reserved. Setting this parameter is invalid.

  • Contiguous mode: indicates the number of contiguous elements that participate in computation. The value range is related to the operand data type. The maximum number of elements that can be processed in each iteration varies according to the data type. When the operand is 16-bit, mask ∈ [1, 128]. When the operand is 32-bit, mask ∈ [1, 64].
  • Bitwise mode: Elements to be computed can be controlled by bits. If the value of a bit is 1, the corresponding element is computed. If the value of a bit is 0, the corresponding element is not computed. The parameter is a uint64_t array with a length of 2 or 4.

    For example, if mask = [0, 8] and 8 = 0b1000, only the fourth element participates in computation.

    The parameter value range is related to the operand data type. The maximum number of elements that can be processed in each iteration varies according to the data type. When the operand is 16-bit, mask[0]/mask[1] ∈ [0, 264 -1] and cannot be 0 at the same time. When the operand is 32-bit, mask[1] = 0 and mask[0] ∈ (0, 264 -1].

repeatTime

Input

Number of iteration repeats. The Vector Unit reads 256 bytes of contiguous data for computation each time. To read the complete data for processing, the unit needs to read the input data in multiple repeats. repeatTime indicates the number of repeats.

For details about this parameter, see High-dimensional Sharding APIs.

repeatParams

Input

Parameters that control the operand address strides. They are of the UnaryRepeatParams type, and contain such parameters as those that specify the address stride of the operand for the same data block between adjacent iterations and address stride of the operand between different data blocks in a single iteration.

For details about the address stride parameters between adjacent iterations, see repeatStride. For details about the address stride parameters of DataBlock in the same iteration, see dataBlockStride.

count

Input

Number of elements involved in the computation. When setting count, ensure that the space occupied by count elements is 256-byte aligned.

Returns

None

Restrictions

  • dstLocal stores binary results in little-endian order, corresponding to the comparison result of data in the corresponding position in src.
  • For the API that uses the first n pieces of data in the tensor for computation, when setting count, ensure that the space occupied by count elements is 256-byte aligned.

Examples

In this example, the source operand src0Local stores 256 pieces of data of the float type. The example compares the elements in src0Local and src1Local.GetValue(0). If the element in src0Local is smaller than that in src1Local.GetValue(0), the corresponding bit in the dstLocal result is set to 1. Otherwise, the bit is set to 0. The dst result is stored in uint8_t format.

  • API example of computing the first n data elements of a tensor
    1
    AscendC::CompareScalar(dstLocal, src0Local, src1Scalar, AscendC::CMPMODE::LT, srcDataSize);
    
  • High-dimensional tensor sharding computation (contiguous mask mode)
    1
    2
    3
    4
    5
    6
    7
    uint64_t mask = 256 / sizeof(float); // 256 indicates the number of bytes processed in each iteration.
    int repeat = 4;
    AscendC::UnaryRepeatParams repeatParams = { 1, 1, 8, 8 };
    // repeat = 4, 64 elements one repeat, 256 elements total
    // dstBlkStride, srcBlkStride = 1, no gap between blocks in one repeat
    // dstRepStride, srcRepStride = 8, no gap between repeats
    AscendC::CompareScalar(dstLocal, src0Local, src1Scalar, AscendC::CMPMODE::LT, mask, repeat, repeatParams);
    
  • High-dimensional tensor sharding computation (bitwise mask mode)
    1
    2
    3
    4
    5
    6
    7
    uint64_t mask[2] = { UINT64_MAX, 0};
    int repeat = 4;
    AscendC::UnaryRepeatParams repeatParams = { 1, 1, 8, 8 };
    // repeat = 4, 64 elements one repeat, 256 elements total
    // srcBlkStride, = 1, no gap between blocks in one repeat
    // dstRepStride, srcRepStride = 8, no gap between repeats
    AscendC::CompareScalar(dstLocal, src0Local, src1Scalar, AscendC::CMPMODE::LT, mask, repeat, repeatParams);
    
Result example:
Input (src0_gm):
[ 16.604824    45.069473    65.108345   -59.68792     21.043684
  75.90726    -27.046307   -40.10546     -5.933778    83.56574
  58.87062    -12.77814     28.17882     62.549377   -22.310246
 -67.69001     81.06072     69.988945    69.10082     -6.667376
  96.20256     18.532446   -66.56364    -32.531246    49.980835
  35.668995   -16.847628     1.3236234   10.0143795   43.878166
  26.628105    31.774637    47.9279      79.7291     -54.09651
  95.49459    -18.404795   -86.84594      9.406091   -79.54437
   0.49116692 -48.151714   -12.97062    -99.89055     23.475513
 -27.366564   -69.229675    83.613304    52.14729     40.98426
 -23.422009   -53.386215     1.6576616  -62.36946     54.693733
  66.2058      -4.0042257  -25.351263     1.0000885   -6.458584
  25.447659    71.647316    82.31162     -7.7359715   28.107353
 -79.22045     20.292479    67.7434     -76.054085    -7.754251
  38.632687    -4.8460293  -69.791954   -57.574455   -99.96178
 -73.29611    -68.57477     98.200035   -55.30482    -55.590027
  79.53274     -1.862139   -37.60953    -12.225406   -35.2875
 -24.047668   -66.07609     21.9362      80.603516    28.928387
  26.579298    97.6649      78.94723    -89.86824     73.29788
  18.957182   -73.87053    -23.508097   -51.02931     39.158726
 -96.61422    -41.192455    54.973663    47.58695     -3.9818003
 -81.05088    -67.62415    -17.491713   -34.916042   -95.993744
  -3.4719822  -55.956417     6.223455    12.240832    15.055512
  94.70584    -13.33949    -50.46866     54.612816   -28.521824
 -87.63997     59.53054     41.000504   -31.266075   -31.419422
 -32.940186    53.449913    50.012768   -13.663364    40.931725
 -68.80396    -86.63726     76.866585   -83.76385      3.7227867
  58.443035   -74.333046   -92.52674     24.249512    -7.935491
  24.197245   -34.85033     67.854645    72.65312     13.622443
 -70.94266     15.401667    -9.332295   -86.61463     72.659676
 -83.63352      9.279887    81.037964    46.285606   -12.967846
 -48.72901     69.07614    -40.355286   -94.257034   -45.514374
  24.966864    -9.657219    61.803864   -83.09603     77.769035
 -97.44226    -89.71987    -53.969315    43.892918    73.88798
  67.23104     36.65282    -93.70069    -87.48934    -27.679005
 -36.825226   -30.117033   -41.579655   -97.325325    77.1972
 -49.883194    33.061394   -63.844925    89.74327     64.549416
  80.16943     73.26347    -87.307175   -96.62777     81.8532
   7.5365276   28.357092    59.896378   -15.95738    -77.42723
   0.03529428 -20.263502    45.59324    -90.160835    89.478004
  57.608685    60.71819     45.8125      39.94484    -48.77375
 -56.897358     5.2580256   -6.937905   -49.80309    -42.527523
  72.91772     89.53271    -62.181187    18.490683   -69.40782
   6.141204    13.938042    75.312515    21.766457    -8.157599
  55.53147    -30.789118   -12.087165    82.435684    23.4884
  82.73172     -2.026827    -8.124383   -10.707488   -74.32759
 -54.702602    14.209252    93.73145     98.93554     52.803623
  32.200726    41.823833    90.193756   -34.512424   -85.64022
  97.47763     33.353424    94.84875     23.03139     99.97347
 -72.47978     19.51753    -88.28579    -88.70721    -18.659292
 -79.5277      62.90431     21.837631    45.989056    -9.62086
  11.4855795 ]
Input (src1_gm):
[-95.16087   -71.4676     51.817818  -12.358237   96.60704   -12.0067835
 -44.128048    7.5811195  84.61196   -60.303513   21.470125   98.96244
  18.262054   80.014244   48.37233   -75.03457  ]
Output (dst_gm):
[ 0  0  0  0  0  8  0  0  0  4  0  0 16 32  0  0  0  0  0  0 32  0  4 16
  0  0  0  0  0  0  0  0]

Template Sample

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
#include "kernel_operator.h"
template <typename T> class KernelCmp {
public:
    __aicore__ inline KernelCmp() {}
    __aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm,
        uint32_t dataSize, AscendC::CMPMODE mode)
    {
        srcDataSize = dataSize;
        dstDataSize = srcDataSize / 8;
        cmpMode = mode;
        src0Global.SetGlobalBuffer((__gm__ T*)src0Gm);
        src1Global.SetGlobalBuffer((__gm__ T*)src1Gm);
        dstGlobal.SetGlobalBuffer((__gm__ uint8_t*)dstGm);
        pipe.InitBuffer(inQueueSrc0, 1, srcDataSize * sizeof(T));
        pipe.InitBuffer(inQueueSrc1, 1, 16 * sizeof(T));
        pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(uint8_t));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<T> src0Local = inQueueSrc0.AllocTensor<T>();
        AscendC::LocalTensor<T> src1Local = inQueueSrc1.AllocTensor<T>();
        AscendC::DataCopy(src0Local, src0Global, srcDataSize);
        AscendC::DataCopy(src1Local, src1Global, 16);
        inQueueSrc0.EnQue(src0Local);
        inQueueSrc1.EnQue(src1Local);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<T> src0Local = inQueueSrc0.DeQue<T>();
        AscendC::LocalTensor<T> src1Local = inQueueSrc1.DeQue<T>();
        AscendC::LocalTensor<uint8_t> dstLocal = outQueueDst.AllocTensor<uint8_t>();
        AscendC::PipeBarrier<PIPE_ALL>();
        T src1Scalar = src1Local.GetValue(0);
        AscendC::PipeBarrier<PIPE_ALL>();
        AscendC::CompareScalar(dstLocal, src0Local, static_cast<T>(src1Scalar), cmpMode, srcDataSize);
        outQueueDst.EnQue<uint8_t>(dstLocal);
        inQueueSrc0.FreeTensor(src0Local);
        inQueueSrc1.FreeTensor(src1Local);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<uint8_t> dstLocal = outQueueDst.DeQue<uint8_t>();
        AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc0, inQueueSrc1;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<T> src0Global, src1Global;
    AscendC::GlobalTensor<uint8_t> dstGlobal;
    uint32_t srcDataSize = 0;
    uint32_t dstDataSize = 0;
    AscendC::CMPMODE cmpMode;
};
template <typename T>
__aicore__ void main_cpu_cmp_sel_demo(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm, uint32_t dataSize, AscendC::CMPMODE mode)
{
    KernelCmp<T> op;
    op.Init(src0Gm, src1Gm, dstGm, dataSize, mode);
    op.Process();
}
extern "C" __global__ __aicore__ void kernel_vec_compare_scalar_256_LT_float(GM_ADDR src0_gm, GM_ADDR src1_gm, GM_ADDR dst_gm)
{
    main_cpu_cmp_sel_demo<float>(src0_gm, src1_gm, dst_gm, 256, AscendC::CMPMODE::LT);
}