CompareScalar
Function Usage
Compares the sizes of an element in a tensor with that of a scalar element by element. If the comparison result is true, the corresponding bit of the output result is 1. Otherwise, the bit is 0.
The following comparison modes are supported:
- LT: lower than
- GT: greater than
- GE: greater than or equal to
- EQ: equal
- NE: not equal to
- LE: lower than or equal to
Prototype
- Computation of the first n data elements of a tensor
1 2
template <typename T, typename U> __aicore__ inline void CompareScalar(const LocalTensor<U>& dstLocal, const LocalTensor<T>& src0Local, const T src1Scalar, CMPMODE cmpMode, uint32_t calCount)
- High-dimensional tensor sharding computation
- Bitwise mask mode
1 2
template <typename T, typename U, bool isSetMask = true> __aicore__ inline void CompareScalar(const LocalTensor<U>& dstLocal, const LocalTensor<T>& src0Local, const T src1Scalar, CMPMODE cmpMode, const uint64_t mask[], uint8_t repeatTimes, const UnaryRepeatParams& repeatParams)
- Contiguous mask mode
1 2
template <typename T, typename U, bool isSetMask = true> __aicore__ inline void CompareScalar(const LocalTensor<U>& dstLocal, const LocalTensor<T>& src0Local, const T src1Scalar, CMPMODE cmpMode, const uint64_t mask, uint8_t repeatTimes, const UnaryRepeatParams& repeatParams)
- Bitwise mask mode
Parameters
|
Parameter |
Description |
|---|---|
|
T |
Data type of the source operand. The value can be half or float. |
|
U |
Data type of the destination operand. uint8_t is supported. |
|
isSetMask |
Indicates whether to set mask inside the API.
|
|
Parameter |
Input/Output |
Meaning |
|---|---|---|
|
dstLocal |
Output |
Destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. dstLocal is used to store the comparison result. The bits of uint8_t data in dstLocal from left to right indicate the comparison result of src0 and src1. If the comparison result is true, the corresponding bit value is 1. Otherwise, the value is 0. |
|
src0Local |
Input |
Source operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. |
|
src1Scalar |
Input |
Source operand (scalar). The data type is the same as that of srcLocal. |
|
cmpMode |
Input |
Comparison mode, including EQ, NE, GE, LE, GT, and LT.
|
|
mask |
Input |
mask is used to control the elements that participate in computation in each iteration. |
|
repeatTimes |
Input |
Number of iteration repeats. The Vector Unit reads 256 bytes of contiguous data for computation each time. To read the complete data for processing, the unit needs to read the input data in multiple repeats. repeatTimes indicates the number of iterations. |
|
repeatParams |
Input |
Parameters that control the operand address strides. They are of the UnaryRepeatParams type, and contain such parameters as those that specify the address stride of the operand for the same data block between adjacent iterations and address stride of the operand between different data blocks in a single iteration. For details about the address stride of the operand between adjacent iterations, see repeatStride. For details about the address stride of the operand between different data blocks in a single iteration, see dataBlockStride. |
|
calCount |
Input |
Number of elements of the input data. When setting calCount, ensure that the space occupied by calCount elements is 256-byte aligned. |
Returns
None
Availability
Precautions
- For details about the alignment requirements of the operand address offset, see General Restrictions.
- dstLocal stores binary results in little-endian order, corresponding to the comparison result of data in the corresponding position in src.
- If the API for computing the first n pieces of tensor data is used, ensure that the space occupied by calCount elements is 256-byte aligned.
Example
In this example, the source operand src0Local stores 256 pieces of data of the float type. The example compares the elements in src0Local and src1Local.GetValue(0). If the element in src0Local is smaller than that in src1Local.GetValue(0), the corresponding bit in the dstLocal result is set to 1. Otherwise, the bit is set to 0. The dst result is stored in uint8_t format.
- API example of computing the first n data elements of a tensor
1AscendC::CompareScalar(dstLocal, src0Local, src1Scalar, AscendC::CMPMODE::LT, srcDataSize);
- High-dimensional tensor sharding computation (contiguous mask mode)
1 2 3 4 5 6 7
uint64_t mask = 256 / sizeof(float); // 256 indicates the number of bytes processed in each iteration. int repeat = 4; AscendC::UnaryRepeatParams repeatParams = { 1, 1, 8, 8 }; // repeat = 4, 64 elements one repeat, 256 elements total // dstBlkStride, srcBlkStride = 1, no gap between blocks in one repeat // dstRepStride, srcRepStride = 8, no gap between repeats AscendC::CompareScalar(dstLocal, src0Local, src1Scalar, AscendC::CMPMODE::LT, mask, repeat, repeatParams);
- High-dimensional tensor sharding computation (bitwise mask mode)
1 2 3 4 5 6 7
uint64_t mask[2] = { UINT64_MAX, 0}; int repeat = 4; AscendC::UnaryRepeatParams repeatParams = { 1, 1, 8, 8 }; // repeat = 4, 64 elements one repeat, 256 elements total // srcBlkStride, = 1, no gap between blocks in one repeat // dstRepStride, srcRepStride = 8, no gap between repeats AscendC::CompareScalar(dstLocal, src0Local, src1Scalar, AscendC::CMPMODE::LT, mask, repeat, repeatParams);
Input (src0_gm): [ 16.604824 45.069473 65.108345 -59.68792 21.043684 75.90726 -27.046307 -40.10546 -5.933778 83.56574 58.87062 -12.77814 28.17882 62.549377 -22.310246 -67.69001 81.06072 69.988945 69.10082 -6.667376 96.20256 18.532446 -66.56364 -32.531246 49.980835 35.668995 -16.847628 1.3236234 10.0143795 43.878166 26.628105 31.774637 47.9279 79.7291 -54.09651 95.49459 -18.404795 -86.84594 9.406091 -79.54437 0.49116692 -48.151714 -12.97062 -99.89055 23.475513 -27.366564 -69.229675 83.613304 52.14729 40.98426 -23.422009 -53.386215 1.6576616 -62.36946 54.693733 66.2058 -4.0042257 -25.351263 1.0000885 -6.458584 25.447659 71.647316 82.31162 -7.7359715 28.107353 -79.22045 20.292479 67.7434 -76.054085 -7.754251 38.632687 -4.8460293 -69.791954 -57.574455 -99.96178 -73.29611 -68.57477 98.200035 -55.30482 -55.590027 79.53274 -1.862139 -37.60953 -12.225406 -35.2875 -24.047668 -66.07609 21.9362 80.603516 28.928387 26.579298 97.6649 78.94723 -89.86824 73.29788 18.957182 -73.87053 -23.508097 -51.02931 39.158726 -96.61422 -41.192455 54.973663 47.58695 -3.9818003 -81.05088 -67.62415 -17.491713 -34.916042 -95.993744 -3.4719822 -55.956417 6.223455 12.240832 15.055512 94.70584 -13.33949 -50.46866 54.612816 -28.521824 -87.63997 59.53054 41.000504 -31.266075 -31.419422 -32.940186 53.449913 50.012768 -13.663364 40.931725 -68.80396 -86.63726 76.866585 -83.76385 3.7227867 58.443035 -74.333046 -92.52674 24.249512 -7.935491 24.197245 -34.85033 67.854645 72.65312 13.622443 -70.94266 15.401667 -9.332295 -86.61463 72.659676 -83.63352 9.279887 81.037964 46.285606 -12.967846 -48.72901 69.07614 -40.355286 -94.257034 -45.514374 24.966864 -9.657219 61.803864 -83.09603 77.769035 -97.44226 -89.71987 -53.969315 43.892918 73.88798 67.23104 36.65282 -93.70069 -87.48934 -27.679005 -36.825226 -30.117033 -41.579655 -97.325325 77.1972 -49.883194 33.061394 -63.844925 89.74327 64.549416 80.16943 73.26347 -87.307175 -96.62777 81.8532 7.5365276 28.357092 59.896378 -15.95738 -77.42723 0.03529428 -20.263502 45.59324 -90.160835 89.478004 57.608685 60.71819 45.8125 39.94484 -48.77375 -56.897358 5.2580256 -6.937905 -49.80309 -42.527523 72.91772 89.53271 -62.181187 18.490683 -69.40782 6.141204 13.938042 75.312515 21.766457 -8.157599 55.53147 -30.789118 -12.087165 82.435684 23.4884 82.73172 -2.026827 -8.124383 -10.707488 -74.32759 -54.702602 14.209252 93.73145 98.93554 52.803623 32.200726 41.823833 90.193756 -34.512424 -85.64022 97.47763 33.353424 94.84875 23.03139 99.97347 -72.47978 19.51753 -88.28579 -88.70721 -18.659292 -79.5277 62.90431 21.837631 45.989056 -9.62086 11.4855795 ] Input (src1_gm): [-95.16087 -71.4676 51.817818 -12.358237 96.60704 -12.0067835 -44.128048 7.5811195 84.61196 -60.303513 21.470125 98.96244 18.262054 80.014244 48.37233 -75.03457 ] Output (dst_gm): [ 0 0 0 0 0 8 0 0 0 4 0 0 16 32 0 0 0 0 0 0 32 0 4 16 0 0 0 0 0 0 0 0]
Template Sample
#include "kernel_operator.h"
template <typename T> class KernelCmp {
public:
__aicore__ inline KernelCmp() {}
__aicore__ inline void Init(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm,
uint32_t dataSize, AscendC::CMPMODE mode)
{
srcDataSize = dataSize;
dstDataSize = srcDataSize / AscendC::AscendCUtils::GetBitSize(sizeof(uint8_t));
cmpMode = mode;
src0Global.SetGlobalBuffer((__gm__ T*)src0Gm);
src1Global.SetGlobalBuffer((__gm__ T*)src1Gm);
dstGlobal.SetGlobalBuffer((__gm__ uint8_t*)dstGm);
pipe.InitBuffer(inQueueSrc0, 1, srcDataSize * sizeof(T));
pipe.InitBuffer(inQueueSrc1, 1, 16 * sizeof(T));
pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(uint8_t));
}
__aicore__ inline void Process()
{
CopyIn();
Compute();
CopyOut();
}
private:
__aicore__ inline void CopyIn()
{
AscendC::LocalTensor<T> src0Local = inQueueSrc0.AllocTensor<T>();
AscendC::LocalTensor<T> src1Local = inQueueSrc1.AllocTensor<T>();
AscendC::DataCopy(src0Local, src0Global, srcDataSize);
AscendC::DataCopy(src1Local, src1Global, 16);
inQueueSrc0.EnQue(src0Local);
inQueueSrc1.EnQue(src1Local);
}
__aicore__ inline void Compute()
{
AscendC::LocalTensor<T> src0Local = inQueueSrc0.DeQue<T>();
AscendC::LocalTensor<T> src1Local = inQueueSrc1.DeQue<T>();
AscendC::LocalTensor<uint8_t> dstLocal = outQueueDst.AllocTensor<uint8_t>();
AscendC::PipeBarrier<PIPE_ALL>();
T src1Scalar = src1Local.GetValue(0);
AscendC::PipeBarrier<PIPE_ALL>();
AscendC::CompareScalar(dstLocal, src0Local, static_cast<T>(src1Scalar), cmpMode, srcDataSize);
outQueueDst.EnQue<uint8_t>(dstLocal);
inQueueSrc0.FreeTensor(src0Local);
inQueueSrc1.FreeTensor(src1Local);
}
__aicore__ inline void CopyOut()
{
AscendC::LocalTensor<uint8_t> dstLocal = outQueueDst.DeQue<uint8_t>();
AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize);
outQueueDst.FreeTensor(dstLocal);
}
private:
AscendC::TPipe pipe;
AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc0, inQueueSrc1;
AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst;
AscendC::GlobalTensor<T> src0Global, src1Global;
AscendC::GlobalTensor<uint8_t> dstGlobal;
uint32_t srcDataSize = 0;
uint32_t dstDataSize = 0;
AscendC::CMPMODE cmpMode;
};
template <typename T>
__aicore__ void main_cpu_cmp_sel_demo(__gm__ uint8_t* src0Gm, __gm__ uint8_t* src1Gm, __gm__ uint8_t* dstGm, uint32_t dataSize, AscendC::CMPMODE mode)
{
KernelCmp<T> op;
op.Init(src0Gm, src1Gm, dstGm, dataSize, mode);
op.Process();
}
extern "C" __global__ __aicore__ void kernel_vec_compare_scalar_256_LT_float(GM_ADDR src0_gm, GM_ADDR src1_gm, GM_ADDR dst_gm)
{
main_cpu_cmp_sel_demo<float>(src0_gm, src1_gm, dst_gm, 256, AscendC::CMPMODE::LT);
}