ReduceMin

Applicability

Product	Supported
Atlas A3 training products / Atlas A3 inference products	√
Atlas A2 training products / Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	√
Atlas inference product 's AI Core	√
Atlas inference product 's Vector Core	x
Atlas training products	√

Function

Obtains the minimum value and its corresponding index position among the input data. For details about reduction instructions, see How to Use Reduction Compute APIs. For details about the ReduceMin computation principle, see ReduceMax.

Prototype

Computation of the first n data elements of a tensor

        
             template <typename T>
__aicore__ inline void ReduceMin(const LocalTensor<T>& dst, const LocalTensor<T>& src, const LocalTensor<T>& sharedTmpBuffer, const int32_t count, bool calIndex = 0)

High-dimensional tensor sharding computation

Bitwise mask mode

          
               template <typename T>
__aicore__ inline void ReduceMin(const LocalTensor<T>& dst, const LocalTensor<T>& src, const LocalTensor<T>& sharedTmpBuffer, const uint64_t mask[], const int32_t repeatTime, const int32_t srcRepStride, bool calIndex = 0)

Contiguous mask mode

          
               template <typename T>
__aicore__ inline void ReduceMin(const LocalTensor<T>& dst, const LocalTensor<T>& src, const LocalTensor<T>& sharedTmpBuffer, const int32_t mask, const int32_t repeatTime, const int32_t srcRepStride, bool calIndex = 0)

Parameters

**Table 1** Template parameters
Parameter	Description
T	Operand data type. For the Atlas A3 training products / Atlas A3 inference products , the supported data types are half and float. For the Atlas A2 training products / Atlas A2 inference products , the supported data types are half and float. For Atlas 200I/500 A2 inference products , the supported data types are half and float. For the Atlas inference product 's AI Core, the supported data types are half and float. For the Atlas training products , the supported data type is half.

**Table 2** Parameters
Parameter	Input/Output	Meaning
dst	Output	Destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 4-byte aligned (for data of the half type) or 8-byte aligned (for data of the float type).
src	Input	Source operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. The data type of the source operand must be the same as that of the destination operand.
sharedTmpBuffer	Input	Space required by some hardware models to store intermediate results during API execution. The space must meet the minimum space requirement. For details about the computation method, see the ReduceMax computation diagram. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. Its data type must match that of the destination operand. For Atlas A3 training products / Atlas A3 inference products , sharedTmpBuffer is required. For Atlas A2 training products / Atlas A2 inference products , sharedTmpBuffer is required. For Atlas 200I/500 A2 inference products , sharedTmpBuffer is required. For the Atlas inference product 's AI Core, sharedTmpBuffer is required. For Atlas training products , sharedTmpBuffer is required.
count	Input	Number of elements involved in the computation. The valid value range of the parameter depends on the data type of the operand. Different data types support different maximum numbers of elements that can be processed. The maximum amount of data processed must not exceed the UB size limit.
mask/mask[]	Input	mask is used to control the elements that participate in computation in each iteration. Bitwise mode: controls which elements are involved in computation bit by bit. A bit value of 1 means the corresponding element participates in computation, while 0 means it does not. The mask value is an array. The array length and the value range of the array elements are related to the operand data type. When the operand is 16-bit, the array length is 2, mask[0] and mask[1] ∈ [0, 2⁶⁴ -1] and cannot be 0 at the same time. When the operand is 32-bit, the array length is 1 and mask[0] ∈ (0, 2⁶⁴ – 1]. When the operand is 64-bit, the array length is 1 and mask[0] ∈ (0, 2³² – 1]. For example, if mask = [0, 8] and 8 = 0b1000, only the fourth element participates in computation. Contiguous mode: indicates the number of contiguous elements that participate in computation. The value range is related to the operand data type. The maximum number of elements that can be processed in each repeat varies according to the data type. When the operand is 16-bit, mask ∈ [1, 128]. When the operand is 32-bit, mask ∈ [1, 64]. When the operand is 64-bit, mask ∈ [1, 32].
repeatTime	Input	Number of iteration repeats. Unlike High-dimensional Sharding APIs, a larger value range is supported. Ensure that the value does not exceed the maximum value of int32_t.
srcRepStride	Input	Address stride between adjacent iterations of the source operand, that is, the number of data blocks skipped of the source operand in each iteration. For details, see repeatStride.
calIndex	Input	Whether to obtain the index of the minimum value. The value is of type bool and defaults to false. true: Obtain both the minimum value and its index. false: Obtain only the minimum value.

Returns

None

Restrictions

For details about the operand address alignment requirements, see General Address Alignment Restrictions.
For details about the constraints on operand address overlapping, see General Address Overlapping Restrictions. If sharedTmpBuffer is required, address overlapping between dst and sharedTmpBuffer is supported (usually dst requires less space than sharedTmpBuffer). However, sharedTmpBuffer must meet the minimum space requirement; otherwise, address overlapping is not supported.

Data is stored in dst in the order of minimum value and its index. If the index is not required, only the minimum value is stored. In the returned result, the indexes are stored based on the data type of dst. For example, if dst uses the half type, the indexes are stored based on the half type. However, the index values would be incorrect if they are read based on the half type and must be converted to the integer type using the reinterpret_cast method. If the input type is half, reinterpret_cast<uint16_t*> is required. If the input type is float, reinterpret_cast<uint32_t*> is required. For example, in the complete example of calling the high-dimensional tensor sharding computation API, the computation result is [0.01034, 2.104e-05]. The reinterpret_cast method is required to convert 2.104e-05 to the index value 353. The following is a conversion example:

        
             float minIndex = dst.GetValue(1);
uint32_t realIndex = *reinterpret_cast<uint32_t*>(&minIndex);

If multiple minimum values exist, the index of the first minimum value is returned.
When the input type is half, the obtained index value cannot be greater than 65535 (the maximum that can be represented by uint16_t).

Example

Example of high-dimensional tensor sharding computation (contiguous mask mode)

        
             // dstLocal, srcLocal, and sharedTmpBuffer are of the half type. For srcLocal, the computation data is of size 8320 and is continuously arranged. It requires indexes and uses the high-dimensional tensor sharding computation API. repeatTime is set to 65. mask is set to involving all elements in the computation.
int32_t mask = 128;
AscendC::ReduceMin<half>(dstLocal, srcLocal, sharedTmpBuffer, mask, 65, 8, true);

Example of high-dimensional tensor sharding computation (bitwise mask mode)

        
             // dstLocal, srcLocal, and sharedTmpBuffer are of the half type. For srcLocal, the computation data is of size 8320 and is continuously arranged. It requires indexes and uses the high-dimensional tensor sharding computation API. repeatTime is set to 65. mask is set to involving all elements in the computation.
uint64_t mask[2] = { 0xFFFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFF };
AscendC::ReduceMin<half>(dstLocal, srcLocal, sharedTmpBuffer, mask, 65, 8, true);

Example of computing the first n data elements of a tensor

        
             // dstLocal, srcLocal, and sharedTmpBuffer are of the half type. For srcLocal, the computation data is of size 8320 and is continuously arranged. It requires indexes and uses the API for computing the first n data elements of a tensor.
AscendC::ReduceMin<half>(dstLocal, srcLocal, sharedTmpBuffer, 8320, true);

The following is a complete example of calling the high-dimensional tensor sharding computation API:

        
         
           
           
             #include "kernel_operator.h"
class KernelReduce {
public:
    __aicore__ inline KernelReduce() {}
    __aicore__ inline void Init(__gm__ uint8_t* src, __gm__ uint8_t* dstGm)
    {
        srcGlobal.SetGlobalBuffer((__gm__ half*)src);
        dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm);
        repeat = srcDataSize / mask;
        pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(half));
        pipe.InitBuffer(workQueue, 1, 32 * sizeof(half)); // Based on the formula, the required minimum work space is 32, that is, 64 bytes.
        pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>();
        AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize);
        inQueueSrc.EnQue(srcLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>();
        AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
        AscendC::LocalTensor<half> sharedTmpBuffer = workQueue.AllocTensor<half>();
 
        AscendC::ReduceMin<half>(dstLocal, srcLocal, sharedTmpBuffer, mask, repeat, repStride, true);
        outQueueDst.EnQue<half>(dstLocal);
        inQueueSrc.FreeTensor(srcLocal);
        workQueue.FreeTensor(sharedTmpBuffer);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>();
        AscendC::DataCopy(dstGlobal, dstLocal, srcDataSize);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> workQueue;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<half> srcGlobal, dstGlobal;
    int srcDataSize = 512;
    int dstDataSize = 512;
    int mask = 128;
    int repStride = 8;
    int repeat = 0;
};
extern "C" __global__ __aicore__ void kernel_ReduceMin_lv0_half_512(__gm__ uint8_t* src, __gm__ uint8_t* dstGm)
{
    KernelReduce op;
    op.Init(src, dstGm);
    op.Process();
}

            

          

        
       

The following is an example:

Input (src_gm):
[0.769    0.8584   0.1082   0.2715   0.1759   0.7646   0.6406   0.2944   0.4255   0.927    0.8022   0.04507  0.9688   0.919    0.3008   0.7144   0.3206   0.6753   0.8276
 0.3374   0.4636   0.3591   0.112    0.93     0.822    0.7314   0.01165  0.31     0.5586   0.2808   0.3997   0.04544  0.0931   0.8438   0.612    0.03052  0.3652   0.1153
 0.06213  0.12103  0.4421   0.8003   0.1583   0.845    0.125    0.6934   0.4592   0.871    0.573    0.4133   0.885    0.6875   0.2854   0.7007   0.1294   0.2092   0.3794
 0.7534   0.5923   0.03888  0.2412   0.8584   0.6704   0.429    0.77     0.427    0.6323   0.524    0.0519   0.514    0.2408   0.09357  0.1702   0.3694   0.665    0.2651
 0.9507   0.661    0.459    0.1317   0.7334   0.289    0.0325   0.1187   0.6626   0.2769   0.3083   0.923    0.826    0.7275   0.976    0.4854   0.724    0.7783   0.8022
 0.677    0.2401   0.377    0.839    0.2297   0.54     0.743    0.511    0.1346   0.7183   0.4775   0.3442   0.561    0.2935   0.04065  0.1001   0.753    0.6816   0.8955
 0.07324  0.5947   0.508    0.2229   0.468    0.3135   0.0898   0.5625   0.7407   0.803    0.1071   0.6724   0.797    0.8296   0.807    0.8604   0.7437   0.967    0.4307
 0.3833   0.03394  0.02478  0.9385   0.3105   0.43     0.0706   0.4363   0.05832  0.0812   0.2418   0.03967  0.557    0.2705   0.963    0.8125   0.342    0.8853   0.3047
 0.7197   0.7173   0.02887  0.7695   0.4304   0.691    0.4285   0.9917   0.3994   0.19     0.3984   0.1888   0.83     0.0644   0.9766   0.857    0.09784  0.831    0.224
 0.8228   0.8975   0.1775   0.725    0.882    0.7188   0.3257   0.05347  0.1026   0.05902  0.9697   0.445    0.728    0.626    0.3577   0.711    0.2343   0.3865   0.03888
 0.3318   0.855    0.891    0.3647   0.9297   0.5083   0.7163   0.5737   0.2155   0.804    0.2118   0.525    0.1116   0.558    0.05203  0.6343   0.5796   0.5605   0.449
 0.4475   0.3713   0.3708   0.11017  0.2048   0.087    0.265    0.937    0.933    0.4683   0.5884   0.4312   0.9326   0.839    0.592    0.566    0.4229   0.05493  0.4578
 0.353    0.2915   0.8345   0.888    0.8394   0.8774   0.3582   0.2913   0.798    0.87     0.3372   0.6914   0.9185   0.4368   0.3276   0.8125   0.782    0.885    0.6543
 0.1626   0.0965   0.8247   0.03952  0.459    0.5596   0.694    0.59     0.02153  0.3762   0.2428   0.9727   0.3672   0.732    0.2676   0.2102   0.128    0.5957   0.988
 0.583    0.9097   0.144    0.3845   0.2151   0.327    0.2925   0.974    0.771    0.9224   0.147    0.6206   0.1774   0.1415   0.7637   0.573    0.9736   0.183    0.837
 0.0753   0.098    0.8184   0.08527  0.889    0.528    0.2207   0.1852   0.5903   0.594    0.04865  0.5806   0.6006   0.2048   0.4934   0.1302   0.7217   0.949    0.04105
 0.6875   0.3975   0.845    0.6045   0.4077   0.01927  0.1505   0.4407   0.8457   0.9614   0.4504   0.7134   0.07837  0.3557   0.521    0.545    0.02188  0.581    0.3215
 0.4458   0.853    0.4656   0.928    0.2927   0.3467   0.3516   0.1686   0.88     0.1509   0.2993   0.4006   0.611    0.1251   0.0887   0.896    0.2651   0.5596   0.0359
 0.6895   0.3494   0.871    0.673    0.1486   0.7812   0.0925   0.434    0.09985  0.02402  0.2932   0.01034  0.744    0.6357   0.658    0.1487   0.3416   0.1171   0.3088
 0.557    0.837    0.10944  0.7036   0.9097   0.3706   0.73     0.2844   0.78     0.5117   0.5537   0.776    0.6553   0.128    0.3184   0.8022   0.686    0.1785   0.2212
 0.74     0.8955   0.4773   0.6084   0.7827   0.239    0.4849   0.1816   0.2854   0.166    0.012505 0.4421   0.2179   0.06094  0.2124   0.409    0.641    0.1841   0.776
 0.4685   0.2334   0.4094   0.3447   0.6836   0.434    0.10516  0.514    0.8345   0.371    0.8555   0.5396   0.844    0.7554   0.171    0.749    0.7344   0.05936  0.4482
 0.9873   0.3137   0.7627   0.871    0.5503   0.956    0.2607   0.0904   0.535    0.3079   0.762    0.793    0.545    0.889    0.8936   0.6094   0.6533   0.5737   0.945
 0.4434   0.2686   0.05872  0.0776   0.0915   0.5386   0.6777   0.3164   0.8955   0.3398   0.3801   0.3784   0.3904   0.4849   0.816    0.962    0.335    0.705    0.1871
 0.3643   0.7163   0.6484   0.4526   0.8096   0.2408   0.608    0.0215   0.7246   0.412    0.609    0.03342  0.653    0.0424   0.672    0.627    0.3025   0.9424   0.3784
 0.1012   0.4192   0.7695   0.7383   0.9395   0.06494  0.3027   0.11523  0.6035   0.1727   0.4048   0.932    0.4053   0.3528   0.8193   0.0355   0.01953  0.574    0.509
 0.1443   0.0848   0.568    0.8716   0.968    0.613    0.535    0.0389   0.84     0.0655   0.127    0.06104  0.526    0.504    0.4175   0.8027   0.482    0.304   ]
Output (dst_gm):
In [0.01034, 2.104e-05], 2.104e-05 is converted using the reinterpret_cast method to obtain the index value 353.

The following is a complete example of calling the API for computing the first n data elements of a tensor:

        
         
           
           
             #include "kernel_operator.h"
class KernelReduce {
public:
    __aicore__ inline KernelReduce() {}
    __aicore__ inline void Init(__gm__ uint8_t* src, __gm__ uint8_t* dstGm)
    {
        srcGlobal.SetGlobalBuffer((__gm__ half*)src);
        dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm);
        repeatTime = srcDataSize / mask;
        pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(half));
        pipe.InitBuffer(workQueue, 1, 32 * sizeof(half)); // Based on the formula, the required minimum work space is 32, that is, 64 bytes.
        pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>();
        AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize);
        inQueueSrc.EnQue(srcLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>();
        AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
        AscendC::LocalTensor<half> sharedTmpBuffer = workQueue.AllocTensor<half>();
       
        // level2
        AscendC::ReduceMin<half>(dstLocal, srcLocal, sharedTmpBuffer, srcDataSize, true);
        outQueueDst.EnQue<half>(dstLocal);
        inQueueSrc.FreeTensor(srcLocal);
        workQueue.FreeTensor(sharedTmpBuffer);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>();
        AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize);
        outQueueDst.FreeTensor(dstLocal);
    }
private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> workQueue;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<half> srcGlobal, dstGlobal;
    int srcDataSize = 288;
    int dstDataSize = 16;
    int mask = 128;
    int repStride = 8;
    int repeatTime = 0;
};
extern "C" __global__ __aicore__ void kernel_ReduceMin_lv2_half_288(__gm__ uint8_t* src, __gm__ uint8_t* dstGm)
{
    KernelReduce op;
    op.Init(src, dstGm);
    op.Process();
}

            

          

        
       

The following is an example:

Result example:
Input (src_gm):
[0.556    0.5225   0.3623   0.214    0.556    0.0643   0.769    0.594    0.261    0.3652   0.911    0.924    0.386    0.3696   0.2296   0.5957   0.1709   0.79     0.8516
 0.341    0.705    0.728    0.8135   0.7534   0.5874   0.771    0.05835  0.7456   0.1049   0.3105   0.1729   0.9253   0.8003   0.918    0.5005   0.7744   0.688    0.6807
 0.1456   0.4136   0.1055   0.12054  0.275    0.3848   0.08405  0.3843   0.3218   0.6904   0.878    0.3706   0.3586   0.3518   0.429    0.7275   0.6123   0.8096   0.563
 0.54     0.8857   0.8594   0.4143   0.525    0.2744   0.1376   0.382    0.6406   0.1534   0.134    0.2993   0.365    0.8843   0.2986   0.00393  0.6577   0.313    0.8164
 0.8706   0.7686   0.873    0.3286   0.03787  0.8145   0.4656   0.66     0.1362   0.1075   0.1376   0.9097   0.9214   0.833    0.3657   0.8438   0.006973 0.2408   0.801
 0.1862   0.864    0.8745   0.1805   0.4324   0.8647   0.844    0.8936   0.8496   0.311    0.0334   0.3967   0.579    0.43     0.2332   0.5366   0.3557   0.3542   0.945
 0.9336   0.252    0.4375   0.9727   0.859    0.6294   0.6787   0.8887   0.1884   0.524    0.787    0.04755  0.3984   0.0508   0.4065   0.716    0.3184   0.21     0.10645
 0.7544   0.2827   0.7856   0.4878   0.5903   0.12146  0.6426   0.8438   0.063    0.7617   0.6396   0.1995   0.6475   0.1464   0.7617   0.514    0.3506   0.2708   0.8643
 0.1204   0.04337  0.21     0.528    0.0644   0.2133   0.0643   0.0125   0.602    0.654    0.866    0.225    0.9473   0.408    0.4597   0.2793   0.11145  0.293    0.04156
 0.7705   0.3555   0.3977   0.7485   0.76     0.9824   0.2832   0.1239   0.4915   0.878    0.5986   0.7217   0.832    0.6206   0.6455   0.0639   0.772    0.01854  0.7437
 0.1962   0.485    0.5483   0.414    0.9253   0.2452   0.2942   0.9478   0.879    0.586    0.659    0.635    0.7197   0.933    0.08905  0.02892  0.74     0.499    0.02054
 0.2241   0.5137   0.8325   0.185    0.6196   0.949    0.935    0.5605   0.04108  0.3672   0.5566   0.3958   0.4565   0.8135   0.3015   0.46     0.1196   0.5044   0.54
 0.05203  0.687    0.8525   0.501    0.3464   0.307    0.804    0.0926   0.202    0.999    0.955    0.581    0.06216  0.271    0.9365   0.854    0.4202   0.269    0.985
 0.04547  1.       0.1208   0.5225   0.00935  0.4128   0.644    0.3826   0.6963   0.2942   0.007626 0.7144   0.609    0.3206   0.694    0.393    0.6265   0.6904   0.2487
 0.9478   0.798    0.891    0.8867   0.9414   0.395    0.11285  0.515    0.919    0.013855 0.749    0.5527   0.465    0.451    0.1458   0.59     0.893    0.0146   0.062
 0.06604  0.934    0.2242  ]
Output (dst_gm):
In [0.00393, 4.3e-06], 4.3e-06 is converted using the reinterpret_cast method to obtain the index value 72.

Parent topic: Reduction Computation