ReduceMax

Applicability

Product

Supported/Unsupported

Atlas A3 training products / Atlas A3 inference products

Atlas A2 training products / Atlas A2 inference products

Atlas 200I/500 A2 inference products

Atlas inference product 's AI Core

Atlas inference product 's Vector Core

x

Atlas training products

Functions

Obtains the maximum value and its corresponding index position among the input data. For details about reduction instructions, see Reduction Instructions.

Prototype

  • Computation of the first n data elements of a tensor
    1
    2
    template <typename T>
    __aicore__ inline void ReduceMax(const LocalTensor<T>& dst, const LocalTensor<T>& src, const LocalTensor<T>& sharedTmpBuffer, const int32_t count, bool calIndex = 0)
    
  • High-dimensional tensor sharding computation
    • Bitwise mask mode
      1
      2
      template <typename T>
      __aicore__ inline void ReduceMax(const LocalTensor<T>& dst, const LocalTensor<T>& src, const LocalTensor<T>& sharedTmpBuffer, const uint64_t mask[], const int32_t repeatTime, const int32_t srcRepStride, bool calIndex = 0)
      
    • Contiguous mask mode
      1
      2
      template <typename T>
      __aicore__ inline void ReduceMax(const LocalTensor<T>& dst, const LocalTensor<T>& src, const LocalTensor<T>& sharedTmpBuffer, const int32_t mask, const int32_t repeatTime, const int32_t srcRepStride, bool calIndex = 0)
      

Parameters

Table 1 Parameters in the template

Parameter

Description

T

Operand data type.

For the Atlas A3 training products / Atlas A3 inference products , the supported data types are half and float.

For the Atlas A2 training products / Atlas A2 inference products , the supported data types are half and float.

For the Atlas 200I/500 A2 inference products , the supported data types are half and float.

For the Atlas inference product 's AI Core, the supported data types are half and float.

For the Atlas training products , the supported data type is half.

Table 2 Parameters

Parameter

Input/Output

Meaning

dst

Output

Destination operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The start address of the LocalTensor must be 4-byte aligned (for data of the half type) or 8-byte aligned (for data of the float type).

src

Input

Source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The start address of the LocalTensor must be 32-byte aligned.

The source operand must have the same data type as the destination operand.

sharedTmpBuffer

Input

Space required by some hardware models to store intermediate results during API execution. The space must meet the minimum space requirement. For details about the computation method, see the ReduceMax computation diagram.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The start address of the LocalTensor must be 32-byte aligned.

The source operand must have the same data type as the destination operand.

Atlas A3 training products / Atlas A3 inference products : The sharedTmpBuffer is required.

Atlas A2 training products / Atlas A2 inference products : The sharedTmpBuffer is required.

Atlas 200I/500 A2 inference products : The sharedTmpBuffer is required.

Atlas inference product 's AI Core: The sharedTmpBuffer is required.

Atlas training products : The sharedTmpBuffer is required.

count

Input

Number of elements involved in the computation.

The parameter value range is related to the operand data type. The maximum number of elements that can be processed varies according to the data type. However, the maximum size of data that can be processed cannot exceed the UB size limit.

calIndex

Input

A bool that specifies whether to obtain the maximum value with index. Defaults to false.

  • true: obtains the maximum value with index.
  • false: obtains the maximum value without index.

mask/mask[]

Input

The mask parameter is used to control the elements involved in computation in each iteration.

  • Bitwise mode: controls the elements that participate in computation by bit. If a bit is set to 1, the corresponding element participates in the computation. If a bit is set to 0, the corresponding element is masked in the computation.

    The mask is in array form. The array length and the value range of the array elements are related to the data type of the operand. When the operand is 16-bit, the array length is 2. In this case, mask[0] and mask[1] must be in the range of [0, 264 – 1] and cannot be 0 at the same time. When the operand is 32-bit, the array length is 1. In this case, mask[0] must be in the range of (0, 264 – 1]. When the operand is 64-bit, the array length is 1. In this case, mask[0] must be in the range of (0, 232 – 1].

    For example, if mask = [0, 8] and 8 = 0b1000, only the fourth element participates in computation.

  • Contiguous mode: indicates the number of contiguous elements that participate in computation. The value range is related to the operand data type. The maximum number of elements that can be processed in each repeat varies according to the data type. When the operand is 16-bit, mask ∈ [1, 128]. When the operand is 32-bit, mask ∈ [1, 64]. When the operand is 64-bit, mask ∈ [1, 32].

repeatTime

Input

Number of iteration repeats. Different from General Parameters, a larger value range is supported. Ensure that the value does not exceed the maximum value range of int32_t.

srcRepStride

Input

Address stride between adjacent iterations of the source operand, that is, the number of data blocks skipped of the source operand in each iteration. For details, see repeatStride.

As shown in the ReduceMax computation diagram, the maximum value and its index are obtained in each repeat and stored in workLocal as intermediate results. Then, the maximum values of the intermediate results are computed in each repeat. As the rule applies, the final maximum value and its index are computed in the destination operand. Note that the index of the maximum value obtained in each repeat is an internal index of the repeat. When the final result is returned, the index in the full data needs to be derived based on the iteration positions and internal indexes.

Figure 1 ReduceMax computation

If the data size is large, the final result can be obtained through multiple rounds of computation rather than a single round. Similarly, the index of the maximum value obtained in each repeat is an internal index of the repeat. When the final result is returned, the index in the full data needs to be derived based on the iteration positions and internal indexes.

Figure 2 Multi-round computation

The workLocal space needs to be requested and passed by developers. The method for computing this space varies depending on whether the index needs to be returned. If the index needs to be returned, the space required for each round of computation needs to be accumulated, with 32-byte alignment of the UB space satisfied. If the index does not need to be returned, only the space required for the first round of computation needs to be provided, with 32-byte alignment satisfied. The space can be directly used for subsequent rounds. In this case, index derivation is not required, and the intermediate data of the previous rounds can be directly overwritten. The algorithm for computing the minimum required space is as follows:

  • If the index of the maximum value does not need to be returned:
    1
    2
    3
    4
    int firstMaxRepeat = repeatTime; // For the tensor high-dimensional splitting and computation API, firstMaxRepeat is repeatTime. For the API for computing the first n pieces of data in the tensor, firstMaxRepeat is count/elementsPerRepeat.
    int iter1OutputCount = firstMaxRepeat * 2;                                            // Number of elements generated in the first repeat. The underlying instruction always returns the index regardless of the developer's need. Therefore, a space must be reserved for the index. The number of generated elements is the number of repeat times multiplied by 2.
    int iter1AlignEnd = RoundUp(iter1OutputCount, elementsPerBlock) * elementsPerBlock;   // The number of elements generated in the first round is rounded up based on the data block (32 bytes).
    int finalWorkLocalNeedSize = iter1AlignEnd;                                           // After the first round of computation is complete, more iterations may be required. However, the same space can be reused. Therefore, the space required by the first round of computation is the workLocal space.
    
  • If the index of the maximum value needs to be returned:
    1
    2
    3
    4
    5
    6
    7
    8
    9
    int firstMaxRepeat = repeatTime; // For the tensor high-dimensional splitting and computation API, firstMaxRepeat is repeatTime. For the API for computing the first n pieces of data in the tensor, firstMaxRepeat is count/elementsPerRepeat.
    int iter1OutputCount = firstMaxRepeat * 2;                                            // Number of elements generated in the first repeat.
    int iter2AlignStart = RoundUp(iter1OutputCount, elementsPerBlock) * elementsPerBlock; // Start position offset of the second repeat, that is, the number of elements generated in the first repeat is rounded up based on the data block (32 bytes).
    // After the first round of computation is complete, more iterations may be required. In this case, the same space cannot be reused because the intermediate result indexes of the first round need to be used again. Therefore, you need to prepare the space for the second and third rounds.
    int iter2OutputCount = RoundUp(iter1OutputCount, elementsPerRepeat) * 2;              // Number of elements generated in the second repeat.
    int iter3AlignStart = RoundUp(iter2OutputCount, elementsPerBlock) * elementsPerBlock; // Start position offset of the third repeat, that is, the number of elements generated in the second repeat is rounded up based on the data block (32 bytes).
    int iter3OutputCount = RoundUp(iter2OutputCount, elementsPerRepeat) * 2;              // Number of elements generated in the third repeat.
    int iter3AlignEnd = RoundUp(iter3OutputCount, elementsPerBlock) * elementsPerBlock;   // The number of elements generated in the third round is rounded up based on the data block (32 bytes).
    int finalWorkLocalNeedSize = iter2AlignStart + iter3AlignStart + iter3AlignEnd; // Final size of the sharedTmpBuffer
    

The unit of the computed space size is the number of elements. If the space size is converted into bytes, it is expressed as finalWorkLocalNeedSize * typeSize (Bytes). For details, see the workLocal computation example in Examples.

To save the address space, workLocal can share the same space as the source operand. The minimum worklocal space is always less than the space of the source operand, and therefore can be ignored.

Returns

None

Restrictions

  • For details about the operand address alignment requirements, see General Address Alignment Restrictions.
  • For details about the operand address overlapping restrictions, see General Address Overlap Restrictions. When sharedTmpBuffer is required, the address of dst can overlap with that of sharedTmpBuffer. In most cases, the space required by dst is smaller than that required by sharedTmpBuffer. In this case, sharedTmpBuffer must meet the minimum space requirement. Otherwise, address overlapping is not supported.
  • The dstLocal result is stored in the sequence of maximum value and index. If no index is required, only the maximum value is stored. In the returned result, the indexes are stored based on the data type of dstLocal. For example, if dstLocal uses the half type, the indexes are stored based on the half type. However, the index values would be incorrect if they are read based on the half type and must be converted to the integer type using the reinterpret_cast method. If the input type is half, reinterpret_cast<uint16_t*> is required. If the input type is float, reinterpret_cast<uint32_t*> is required. For example, in the complete example of the high-dimensional tensor sharding computation API, the input type is half, and the computation result is [0.9985, 6.8e-06]. The reinterpret_cast<uint16_t*> method is required for converting 6.8e-06 to yield the index value 114. The following is a conversion example:
    1
    2
    float maxIndex = dst.GetValue(1);
    uint32_t realIndex = *reinterpret_cast<uint32_t*>(&maxIndex);
    
  • If multiple maximum values exist, the index of the first maximum value is returned.
  • When the input type is half, the obtained index value cannot be greater than 65535 (the maximum that can be represented by uint16_t).

Examples

  • Example of high-dimensional tensor sharding computation (contiguous mask mode)
    1
    2
    3
    // dstLocal, srcLocal, and workLocal are of the half type. For srcLocal, the computation data is of size 8320 and is continuously arranged. It requires indexes and uses the high-dimensional tensor sharding computation API. repeatTimes is set to 65. mask is set to involving all elements in the computation.
    int32_t mask = 128;
    AscendC::ReduceMax<half>(dstLocal, srcLocal, sharedTmpBuffer, mask, 65, 8, true);
    
  • Example of high-dimensional tensor sharding computation (bitwise mask mode)
    1
    2
    3
    // dstLocal, srcLocal, and workLocal are of the half type. For srcLocal, the computation data is of size 8320 and is continuously arranged. It requires indexes and uses the high-dimensional tensor sharding computation API. repeatTimes is set to 65. mask is set to involving all elements in the computation.
    uint64_t mask[2] = { 0xFFFFFFFFFFFFFFFF, 0xFFFFFFFFFFFFFFFF };
    AscendC::ReduceMax<half>(dstLocal, srcLocal, sharedTmpBuffer, mask, 65, 8, true);
    
  • Example of computing the first n data elements of a tensor
    1
    2
    // dstLocal, srcLocal, and workLocal are of the half type. For srcLocal, the computation data is of size 8320 and is continuously arranged. It requires indexes and uses the computation API for the first n tensor elements.
    AscendC::ReduceMax<half>(dstLocal, srcLocal, sharedTmpBuffer, 8320, true);
    
  • Example of calculating the sharedTmpBuffer space
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    // Example 1 of calculating sharedTmpBuffer using the ReduceMax API:
    // dstLocal, srcLocal, and sharedTmpBuffer are all of the half type. The amount of data to be computed by srcLocal is 8320. The high-dimensional tensor tiling computation API is used, with repeatTime set to 65 and mask set to 128. The index value is required.
    // The following is an example of calling the high-dimensional tensor sharding computation API:
    AscendC::ReduceMax<half>(dstLocal, srcLocal, sharedTmpBuffer, 128, 65, 8, true);
    // The calculation process of the minimum space required by sharedTmpBuffer is as follows:
    int RoundUp(int a, int b)
    { 
        return (a + b - 1) / b;
    }
    int typeSize = 2;
    int elementsPerBlock = 32 / typeSize = 16; 
    int elementsPerRepeat = 256 / typeSize = 128; 
    int firstMaxRepeat = repeatTime;
    int iter1OutputCount = firstMaxRepeat * 2 = 130;                                          // Number of elements generated in the first repeat
    int iter2AlignStart = RoundUp(iter1OutputCount, elementsPerBlock)*elementsPerBlock = 144; // Round up the number of elements generated in the first repeat.
    int iter2OutputCount = RoundUp(iter1OutputCount, elementsPerRepeat)*2 = 4;                // Number of elements generated in the second repeat
    int iter3AlignStart = RoundUp(iter2OutputCount, elementsPerBlock)*elementsPerBlock = 16;  // Round up the number of elements generated in the second repeat.
    int iter3OutputCount = RoundUp(iter2OutputCount, elementsPerRepeat)*2 = 2;                // Number of elements generated in the third repeat
    int iter3AlignEnd = RoundUp(iter3OutputCount, elementsPerBlock) * elementsPerBlock = 16;  // Round up the number of elements generated in the third repeat.
    // The minimum space required by sharedTmpBuffer is iter2AlignStart + iter3AlignStart + iter3AlignEnd = 144 + 16 + 16 = 176, that is, 352 bytes.
    // Example 2 of calculating sharedTmpBuffer for the ReduceMax API:
    // dstLocal, srcLocal, and sharedTmpBuffer are all of the half type. The amount of data to be computed by srcLocal is 32,640. The high-dimensional tensor tiling and computation API is used. The repeat time is 255, the mask is 128, and the index value is required.
    // The following is an example of calling the high-dimensional tensor sharding computation API:
    AscendC::ReduceMax<half>(dstLocal, srcLocal, sharedTmpBuffer, 128, 255, 8, true);
    // The calculation process of the minimum space required by sharedTmpBuffer is as follows:
    int typeSize = 2;
    int elementsPerBlock = 32 / typeSize = 16; 
    int elementsPerRepeat = 256 / typeSize = 128; 
    int firstMaxRepeat = repeatTime;
    int iter1OutputCount = firstMaxRepeat * 2 = 510;                                          // Number of elements generated in the first repeat
    int iter2AlignStart = RoundUp(iter1OutputCount, elementsPerBlock)*elementsPerBlock = 512; // Round up the number of elements generated in the first repeat.
    int iter2OutputCount = RoundUp(iter1OutputCount, elementsPerRepeat)*2 = 8;                // Number of elements generated in the second repeat
    int iter3AlignStart = RoundUp(iter2OutputCount, elementsPerBlock)*elementsPerBlock = 16;  // Round up the number of elements generated in the second repeat.
    int iter3OutputCount = RoundUp(iter2OutputCount, elementsPerRepeat)*2 = 2;                // Number of elements generated in the third repeat
    int iter3AlignEnd = RoundUp(iter3OutputCount, elementsPerBlock) * elementsPerBlock = 16;  // Round up the number of elements generated in the third repeat.
    // The required space is iter2AlignStart + iter3AlignStart + iter3AlignEnd = 512 + 16 + 16 = 544, that is, 1088 bytes.
    // Example 3 for calculating sharedTmpBuffer using the ReduceMax API:
    // dstLocal, srcLocal, and workLocal are of the half type. For srcLocal, the computation data is of size 65408. It requires indexes and uses the computation API for the first n data elements of a tensor. The value of count is 65408.
    // The following is an example of the computation API for the first n data elements of a tensor:
    AscendC::ReduceMax<half>(dstLocal, srcLocal, sharedTmpBuffer, 65408, true);
    // In this case, the minimum space required by sharedTmpBuffer is calculated as follows:
    int typeSize = 2;
    int elementsPerBlock = 32 / typeSize = 16; 
    int elementsPerRepeat = 256 / typeSize = 128; 
    int firstMaxRepeat = count / elementsPerRepeat = 511;
    int iter1OutputCount = firstMaxRepeat * 2 = 1022;                                          // Number of elements generated in the first repeat
    int iter2AlignStart = RoundUp(iter1OutputCount, elementsPerBlock)*elementsPerBlock = 1024; // Round up the output value of iter1OutputCount.
    int iter2OutputCount = RoundUp(iter1OutputCount, elementsPerRepeat)*2 = 16;                // Number of elements generated in the second repeat
    int iter3AlignStart = RoundUp(iter2OutputCount, elementsPerBlock)*elementsPerBlock = 16;   // Round up the output value of iter2OutputCount.
    int iter3OutputCount = RoundUp(iter2OutputCount, elementsPerRepeat)*2 = 2;                 // Number of elements generated in the third repeat
    int iter3AlignEnd = RoundUp(iter3OutputCount, elementsPerBlock) * elementsPerBlock = 16;   // Round up the number of elements generated in the third repeat.
    // The required space is iter2AlignStart + iter3AlignStart + iter3AlignEnd = 1024 + 16 + 16 = 1056, that is, 2112 bytes.
    // Example 4 for calculating the sharedTmpBuffer of the ReduceMax API:
    // dstLocal, srcLocal, and sharedTmpBuffer are all of the half type. The amount of data to be computed by srcLocal is 512. The high-dimensional tensor segmentation and computation API is used. The values of repeatTime and mask are 4 and 128, respectively. The index value is required.
    // The following is an example of calling the high-dimensional tensor sharding computation API:
    AscendC::ReduceMax<half>(dstLocal, srcLocal, sharedTmpBuffer, 128, 4, 8, true);
    // The calculation process of the minimum space required by sharedTmpBuffer is as follows:
    int typeSize = 2;
    int elementsPerBlock = 32 / typeSize = 16; 
    int elementsPerRepeat = 256 / typeSize = 128; 
    int firstMaxRepeat = repeatTime;
    int iter1OutputCount = firstMaxRepeat * 2 = 8;                                           // Number of elements generated in the first repeat
    int iter2AlignStart = RoundUp(iter1OutputCount, elementsPerBlock)*elementsPerBlock = 16; // Round up the output value of iter1OutputCount.
    int iter2OutputCount = RoundUp(iter1OutputCount, elementsPerRepeat)*2 = 2;               // Number of elements generated in the second repeat
    // In this test case, the number of elements generated in the second repeat is 2. That is, the maximum value and index can be obtained at the end of the second repeat. Therefore, the required space is iter2AlignStart + RoundUp(iter2OutputCount, elementsPerBlock) * elementsPerBlock = 16 + 16 = 32, that is, 64 bytes.
    // Example 5 of the sharedTmpBuffer computation using the ReduceMax API:
    // dstLocal, srcLocal, and workLocal are of the half type. For srcLocal, the computation data is of size 65408. It does not requires indexes but uses the computation API for the first n data elements of a tensor. The value of count is 65408.
    // The following is an example of the computation API for the first n data elements of a tensor:
    AscendC::ReduceMax<half>(dstLocal, srcLocal, sharedTmpBuffer, 65408, false);
    // In this case, the minimum space required by sharedTmpBuffer is calculated as follows:
    int typeSize = 2;
    int elementsPerBlock = 32 / typeSize = 16; 
    int elementsPerRepeat = 256 / typeSize = 128; 
    int firstMaxRepeat = count / elementsPerRepeat = 511;
    int iter1OutputCount = firstMaxRepeat * 2 = 1022;                                          // Number of elements generated in the first repeat
    int iter1AlignEnd = RoundUp(iter1OutputCount, elementsPerBlock) * elementsPerBlock = 1024; // Round up the number of elements generated in the first repeat.
    // Because calIndex is false, the minimum workLocal space is the result obtained after the number of elements generated in the first repeat is rounded up. In this sample, the minimum space is 1024, that is, 2048 bytes.
    // Example 6 for calculating sharedTmpBuffer using the ReduceMax API:
    // dstLocal, srcLocal, and sharedTmpBuffer are all of the float type. The amount of data to be computed by srcLocal is 8320. The high-dimensional tensor segmentation and computation API is used. The values of repeatTime, mask, and index are 130 and 64, respectively.
    // The following is an example of calling the high-dimensional tensor sharding computation API:
    AscendC::ReduceMax<float>(dstLocal, srcLocal, sharedTmpBuffer, 64, 130, 8, true);
    // In this case, the minimum size of sharedTmpBuffer is calculated as follows:
    int typeSize = 4;
    int elementsPerBlock = 32 / typeSize = 8; 
    int elementsPerRepeat = 256 / typeSize = 64; 
    int firstMaxRepeat = repeatTime;
    int iter1OutputCount = firstMaxRepeat * 2 = 260;                                          // Number of elements generated in the first repeat
    int iter2AlignStart = RoundUp(iter1OutputCount, elementsPerBlock)*elementsPerBlock = 264; // Round up the number of elements generated in the first repeat.
    int iter2OutputCount = RoundUp(iter1OutputCount, elementsPerRepeat)*2 = 10;               // Number of elements generated in the second repeat
    int iter3AlignStart = RoundUp(iter2OutputCount, elementsPerBlock)*elementsPerBlock = 16;  // Round up the number of elements generated in the second repeat.
    int iter3OutputCount = RoundUp(iter2OutputCount, elementsPerRepeat)*2 = 2;                // Number of elements generated in the third repeat
    int iter3AlignEnd = RoundUp(iter3OutputCount, elementsPerBlock) * elementsPerBlock = 8;   // Round up the number of elements generated in the third repeat.
    // The minimum space required by sharedTmpBuffer is iter2AlignStart + iter3AlignStart + iter3AlignEnd = 264 + 16 + 8 = 288, that is, 1152 bytes.
    
  • The following is a complete example of the high-dimensional tensor sharding computation API:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    #include "kernel_operator.h"
    class KernelReduce {
    public:
        __aicore__ inline KernelReduce() {}
        __aicore__ inline void Init(__gm__ uint8_t* src, __gm__ uint8_t* dstGm)
        {
            srcGlobal.SetGlobalBuffer((__gm__ half*)src);
            dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm);
            repeat = srcDataSize / mask;
            pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(half));
            pipe.InitBuffer(workQueue, 1, 32 * sizeof(half)); // Based on the formula, the required minimum work space is 32, that is, 64 bytes.
            pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(half));
        }
        __aicore__ inline void Process()
        {
            CopyIn();
            Compute();
            CopyOut();
        }
    private:
        __aicore__ inline void CopyIn()
        {
            AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>();
            AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize);
            inQueueSrc.EnQue(srcLocal);
        }
        __aicore__ inline void Compute()
        {
            AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>();
            AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
            AscendC::LocalTensor<half> sharedTmpBuffer = workQueue.AllocTensor<half>();
            AscendC::ReduceMax<half>(dstLocal, srcLocal, sharedTmpBuffer, mask, repeat, repStride, true);
            outQueueDst.EnQue<half>(dstLocal);
            inQueueSrc.FreeTensor(srcLocal);
            workQueue.FreeTensor(sharedTmpBuffer);
        }
        __aicore__ inline void CopyOut()
        {
            AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>();
            AscendC::DataCopy(dstGlobal, dstLocal, srcDataSize);
            outQueueDst.FreeTensor(dstLocal);
        }
    private:
        AscendC::TPipe pipe;
        AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc;
        AscendC::TQue<AscendC::TPosition::VECOUT, 1> workQueue;
        AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
        AscendC::GlobalTensor<half> srcGlobal, dstGlobal;
        int srcDataSize = 512;
        int dstDataSize = 512;
        int mask = 128;
        int repStride = 8;
        int repeat = 0;
    };
    extern "C" __global__ __aicore__ void kernel_ReduceMax_lv0_half_512(__gm__ uint8_t* src, __gm__ uint8_t* dstGm)
    {
        KernelReduce op;
        op.Init(src, dstGm);
        op.Process();
    }
    

    The following is an example:

    Example result: input (src_gm)
    [0.4795   0.951    0.866    0.008545 0.8037   0.551    0.754    0.73     0.6035   0.251    0.4841   0.05914  0.9414   0.379    0.664    0.6914   0.9307   0.3853   0.4048
     0.7754   0.1265   0.709    0.7695   0.8057   0.9673   0.2566   0.8696   0.243    0.871    0.123    0.76     0.1844   0.7324   0.5757   0.0172   0.7188   0.556    0.3699
     0.7334   0.655    0.919    0.4219   0.82     0.1046   0.5796   0.4773   0.1405   0.3777   0.4421   0.983    0.728    0.642    0.37     0.9473   0.52     0.7783   0.699
     0.716    0.1791   0.1272   0.2471   0.3298   0.3518   0.9756   0.2268   0.6167   0.742    0.4185   0.8193   0.919    0.03827  0.02957  0.2598   0.798    0.3752   0.2109
     0.1753   0.7227   0.829    0.6978   0.347    0.463    0.685    0.1992   0.847    0.941    0.835    0.03336  0.1359   0.04736  0.758    0.5347   0.616    0.869    0.582
     0.694    0.2035   0.3613   0.8413   0.68     0.0896   0.3833   0.0768   0.292    0.11053  0.5586   0.578    0.3286   0.09314  0.5845   0.7124   0.2058   0.6523   0.784
     0.9985   0.6626   0.8975   0.405    0.884    0.7744   0.0258   0.484    0.768    0.7197   0.577    0.03143  0.9185   0.3608   0.3352   0.9077   0.709    0.85     0.4607
     0.61     0.4277   0.1004   0.1995   0.1608   0.2852   0.8887   0.813    0.3396   0.272    0.703    0.1312   0.734    0.2612   0.6895   0.8647   0.9165   0.1455   0.9233
     0.3027   0.7163   0.927    0.1995   0.155    0.6953   0.66     0.04163  0.99     0.544    0.4243   0.804    0.4612   0.01912  0.5127   0.8755   0.6665   0.707    0.01018
     0.874    0.8545   0.9375   0.9844   0.578    0.934    0.683    0.4668   0.63     0.2032   0.3188   0.9478   0.9375   0.03357  0.9927   0.996    0.451    0.1105   0.762
     0.82     0.8047   0.911    0.926    0.1973   0.9175   0.4521   0.4487   0.1273   0.718    0.737    0.305    0.922    0.1396   0.618    0.753    0.5913   0.874    0.08905
     0.003582 0.05252  0.674    0.3923   0.527    0.4106   0.7812   0.113    0.965    0.6157   0.4368   0.6646   0.7944   0.7964   0.531    0.6665   0.517    0.04468  0.5737
     0.752    0.4      0.4463   0.05496  0.939    0.6353   0.2036   0.667    0.3994   0.2573   0.118    0.973    0.5923   0.558    0.7114   0.785    0.714    0.7485   0.854
     0.2585   0.274    0.9824   0.4158   0.283    0.2194   0.3074   0.2793   0.531    0.8965   0.01456  0.5264   0.992    0.856    0.5986   0.734    0.908    0.12317  0.8374
     0.6665   0.1904   0.97     0.2546   0.364    0.6914   0.462    0.05353  0.02975  0.6235   0.4941   0.4714   0.788    0.06537  0.8423   0.2527   0.7734   0.591    0.443
     0.3022   0.02116  0.01605  0.772    0.6924   0.01032  0.594    0.1865   0.7393   0.8887   0.916    0.9653   0.696    0.901    0.1255   0.5513   0.2742   0.5586   0.988
     0.0954   0.4365   0.677    0.894    0.8413   0.05655  0.932    0.4426   0.336    0.848    0.9434   0.1976   0.813    0.773    0.2605   0.1543   0.8555   0.3596   0.997
     0.10315  0.5796   0.5327   0.2283   0.7583   0.3674   0.513    0.9126   0.751    0.532    0.399    0.832    0.549    0.2358   0.6655   0.477    0.5864   0.3528   0.989
     0.1412   0.748    0.3652   0.05292  0.3552   0.5767   0.826    0.4792   0.8477   0.03488  0.8267   0.2345   0.931    0.0884   0.6816   0.4685   0.618    0.09973  0.4385
     0.782    0.6465   0.03882  0.4158   0.1422   0.822    0.8203   0.95     0.3274   0.724    0.929    0.8726   0.004307 0.815    0.67     0.4368   0.7793   0.593    0.4663
     0.2207   0.01773  0.39     0.008896 0.4238   0.716    0.1155   0.601    0.9214   0.3708   0.4285   0.951    0.00431  0.726    0.977    0.1254   0.6484   0.4648   0.891
     0.723    0.6333   0.9077   0.4849   0.3008   0.0495   0.4575   0.266    0.2014   0.1106   0.6914   0.2744   0.4956   0.532    0.1752   0.709    0.3464   0.6104   0.4067
     0.1317   0.8647   0.8      0.4832   0.013855 0.6733   0.4524   0.6865   0.7017   0.9385   0.2957   0.2444   0.4167   0.55     0.8926   0.8364   0.506    0.9966   0.7207
     0.51     0.8745   0.3188   0.847    0.86     0.64     0.08453  0.59     0.2062   0.1031   0.1459   0.3806   0.2096   0.469    0.1492   0.10065  0.536    0.572    0.353
     0.068    0.07855  0.6177   0.3408   0.1538   0.2732   0.997    0.1158   0.4028   0.9536   0.7197   0.585    0.0899   0.3994   0.1835   0.737    0.4639   0.3071   0.47
     0.993    0.3862   0.293    0.1813   0.8193   0.745    0.064    0.7407   0.329    0.198    0.596    0.3      0.6562   0.819    0.2803   0.04095  0.703    0.3425   0.9224
     0.776    0.8057   0.734    0.2534   0.1824   0.793    0.3542   0.2595   0.2607   0.838    0.39     0.631    0.3542   0.1968   0.643    0.015366 0.4106   0.604   ]
    Output (dst_gm):
    In [0.9985, 6.8e-06], 6.8e-06 is converted using the reinterpret_cast method to obtain the index value 114.
  • The following is a complete example of calling the computation API for the first n data elements of a tensor:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    #include "kernel_operator.h"
    class KernelReduce {
    public:
        __aicore__ inline KernelReduce() {}
        __aicore__ inline void Init(__gm__ uint8_t* src, __gm__ uint8_t* dstGm)
        {
            srcGlobal.SetGlobalBuffer((__gm__ half*)src);
            dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm);
    
            repeatTime = srcDataSize / mask;
            pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(half));
            pipe.InitBuffer(workQueue, 1, 32 * sizeof(half)); // Based on the formula, the required minimum work space is 32, that is, 64 bytes.
            pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(half));
        }
        __aicore__ inline void Process()
        {
            CopyIn();
            Compute();
            CopyOut();
        }
    
    private:
        __aicore__ inline void CopyIn()
        {
            AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>();
            AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize);
            inQueueSrc.EnQue(srcLocal);
        }
        __aicore__ inline void Compute()
        {
            AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>();
            AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
            AscendC::LocalTensor<half> sharedTmpBuffer = workQueue.AllocTensor<half>();
    
            // level2
            AscendC::ReduceMax<half>(dstLocal, srcLocal, sharedTmpBuffer, srcDataSize, true);
    
            outQueueDst.EnQue<half>(dstLocal);
            inQueueSrc.FreeTensor(srcLocal);
            workQueue.FreeTensor(sharedTmpBuffer);
        }
        __aicore__ inline void CopyOut()
        {
            AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>();
            AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize);
            outQueueDst.FreeTensor(dstLocal);
        }
    
    private:
        AscendC::TPipe pipe;
        AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc;
        AscendC::TQue<AscendC::TPosition::VECOUT, 1> workQueue;
        AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
        AscendC::GlobalTensor<half> srcGlobal, dstGlobal;
        int srcDataSize = 288;
        int dstDataSize = 16;
        int mask = 128;
        int repStride = 8;
        int repeatTime = 0;
    };
    extern "C" __global__ __aicore__ void kernel_ReduceMax_lv2_half_288(__gm__ uint8_t* src, __gm__ uint8_t* dstGm)
    {
        KernelReduce op;
        op.Init(src, dstGm);
        op.Process();
    }
    

    The following is an example:

    Example result: input (src_gm)
    [0.4778   0.5903   0.2433   0.698    0.1943   0.407    0.891    0.1766   0.5977   0.9473   0.6523   0.10913  0.0143   0.86     0.2366   0.625    0.3696   0.708    0.946
     0.538    0.3826   0.08215  0.516    0.9116   0.1548   0.507    0.8145   0.89     0.5435   0.563    0.1125   0.543    0.3142   0.8086   0.6885   0.874    0.855    0.4019
     0.1613   0.04462  0.945    0.6064   0.6904   0.00758  0.9463   0.528    0.9966   0.629    0.714    0.03134  0.4407   0.0322   0.5376   0.04443  0.03778  0.522    0.793
     0.3086   0.4      0.3984   0.5693   0.8203   0.673    0.796    0.2747   0.2246   0.468    0.1146   0.4468   0.419    0.3816   0.1636   0.1414   0.4028   0.9785   0.8984
     0.4355   0.874    0.864    0.7856   0.739    0.895    0.2487   0.5034   0.958    0.661    0.8755   0.302    0.802    0.563    0.9067   0.1562   0.1337   0.1844   0.3047
     0.543    0.3855   0.9536   0.8633   0.5435   0.002748 0.8916   0.9614   0.3665   0.1588   0.51     0.77     0.552    0.84     0.2798   0.7217   0.8633   0.3794   0.5376
     0.03     0.7783   0.9297   0.9556   0.609    0.1776   0.5957   0.2954   0.6675   0.7183   0.4182   0.8804   0.1837   0.3235   0.3486   0.43     0.8633   0.3972   0.1307
     0.7915   0.43     0.2544   0.827    0.04843  0.1637   0.3376   0.4087   0.4993   0.5923   0.3057   0.04306  0.4905   0.693    0.7393   0.777    0.01379  0.2742   0.669
     0.6826   0.04028  0.0423   0.281    0.12476  0.5366   0.2098   0.559    0.8833   0.82     0.0745   0.7485   0.04004  0.776    0.863    0.1909   0.7876   0.734    0.4727
     0.3655   0.944    0.006794 0.01872  0.687    0.5664   0.9697   0.2437   0.2014   0.0269   0.3975   0.08405  0.36     0.0751   0.02632  0.135    0.531    0.554    0.378
     0.9365   0.5254   0.8687   0.181    0.329    0.322    0.3076   0.508    0.638    0.3462   0.3882   0.7705   0.5933   0.994    0.1188   0.0782   0.94     0.00856  0.1396
     0.2191   0.00648  0.8994   0.6714   0.6724   0.57     0.3127   0.4905   0.2119   0.3938   0.5957   0.1493   0.9424   0.716    0.3699   0.829    0.647    0.8286   0.04514
     0.4028   0.5786   0.148    0.3425   0.999    0.869    0.04288  0.817    0.7075   0.03098  0.621    0.612    0.0774   0.532    0.4395   0.0711   0.4805   0.5835   0.5947
     0.1768   0.52     0.3428   0.9146   0.7324   0.5054   0.7397   0.2737   0.6313   0.1704   0.5093   0.8105   0.1312   0.752    0.3647   0.781    0.4197   0.2329   0.787
     0.762    0.63     0.9263   0.2673   0.1846   0.765    0.921    0.2913   0.3135   0.337    0.2598   0.1782   0.8013   0.641    0.6865   0.736    0.618    0.8755   0.2756
     0.9854   0.8296   0.262   ]
    Output (dst_gm):
    In [0.999, 1.38e-05], 1.38e-05 is converted using the reinterpret_cast method to obtain the index value 232.