MrgSort4
Function Usage
Merges at most four sorted Region Proposal lists into one. The results are sorted in descending order of the score fields.
Prototype
1 2 | template <typename T> __aicore__ inline void MrgSort4(const LocalTensor<T>& dstLocal, const MrgSortSrcList<T>& srcLocal, const MrgSort4Info& params) |
Parameters
Parameter |
Input/Output |
Meaning |
|---|---|---|
dstLocal |
Output |
Destination operand, which stores sorted Region Proposals. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 16-byte aligned (for data of the half type) or 32-byte aligned (for data of the float type). For the |
srcLocal |
Input |
Source operand of the MrgSortSrcList structure type, which contains four sorted Region Proposal lists. For details, see Table 2. For the |
params |
Input |
Parameter required for sorting, which is of the MrgSort4Info structure type. For details, see Table 3. |
Parameter |
Input/Output |
Meaning |
|---|---|---|
src1 |
Input |
Source operand, which stores the first sorted Region Proposal list. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 16-byte aligned (for data of the half type) or 32-byte aligned (for data of the float type). The source operand must have the same data type as the destination operand. For the |
src2 |
Input |
Source operand, which stores the second sorted Region Proposal list. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 16-byte aligned (for data of the half type) or 32-byte aligned (for data of the float type). The source operand must have the same data type as the destination operand. For the |
src3 |
Input |
Source operand, which stores the third sorted Region Proposal list. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 16-byte aligned (for data of the half type) or 32-byte aligned (for data of the float type). The source operand must have the same data type as the destination operand. For the |
src4 |
Input |
Source operand, which stores the fourth sorted Region Proposal list. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 16-byte aligned (for data of the half type) or 32-byte aligned (for data of the float type). The source operand must have the same data type as the destination operand. For the |
Parameter |
Input/Output |
Meaning |
|---|---|---|
elementLengths |
Input |
Lengths of the four source Region Proposal lists (or numbers of Region Proposals), an array of the uint16_t type with a length of 4. Theoretically, the value range of each element is [0, 4095], but the value cannot exceed the storage space of the UB. |
ifExhaustedSuspension |
Input |
A bool specifying whether to stop the instruction when a list is exhausted. The default value is false. |
validBit |
Input |
Number of valid lists. The values are as follows:
|
repeatTimes |
Input |
Number of iteration repeats. The total length of the four lists is skipped for the source and destination operands in each iteration. Value range: repeatTimes ∈ [1,255] The repeatTimes parameter takes effect only when the following conditions are met:
|
Availability
Constraints
- If the score values of proposal [i] and proposal [j] are the same and i is greater than j, proposal [j] is selected first.
- For details about the alignment requirements of the operand address offset, see General Restrictions.
- The destination operand address must not overlap the source operand address.
Example
- API usage example
1 2 3 4 5
// vconcatWorkLocal indicates the created and sorted four Region Proposal lists. Each list has 16 Region Proposals. struct MrgSortSrcList<half> srcList(vconcatWorkLocal[0], vconcatWorkLocal[1], vconcatWorkLocal[2], vconcatWorkLocal[3]); uint16_t elementLengths[4] = {16, 16, 16, 16}; struct MrgSort4Info srcInfo(elementLengths, false, 15, 1); AscendC::MrgSort4(dstLocal, srcList, srcInfo);
- Complete example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85
#include "kernel_operator.h" class KernelVecProposal { public: __aicore__ inline KernelVecProposal() {} __aicore__ inline void Init(__gm__ uint8_t* src, __gm__ uint8_t* dstGm) { srcGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ half*>(src), srcDataSize); dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm); pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(half)); pipe.InitBuffer(workQueue, 1, dstDataSize * sizeof(half)); pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(half)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>(); AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize); inQueueSrc.EnQue(srcLocal); } __aicore__ inline void Compute() { AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>(); AscendC::LocalTensor<half> vconcatWorkLocal = workQueue.AllocTensor<half>(); AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>(); // Construct four Region Proposal lists and then sort and merge them. AscendC::ProposalConcat(vconcatWorkLocal[0], srcLocal[0], repeat, mode); AscendC::RpSort16(vconcatWorkLocal[0], vconcatWorkLocal[0], repeat); AscendC::ProposalConcat(vconcatWorkLocal[workDataSize], srcLocal[singleDataSize], repeat, mode); AscendC::RpSort16(vconcatWorkLocal[workDataSize], vconcatWorkLocal[workDataSize], repeat); AscendC::ProposalConcat(vconcatWorkLocal[workDataSize * 2], srcLocal[singleDataSize * 2], repeat, mode); AscendC::RpSort16(vconcatWorkLocal[workDataSize * 2], vconcatWorkLocal[workDataSize * 2], repeat); AscendC::ProposalConcat(vconcatWorkLocal[workDataSize * 3], srcLocal[singleDataSize * 3], repeat, mode); AscendC::RpSort16(vconcatWorkLocal[workDataSize * 3], vconcatWorkLocal[workDataSize * 3], repeat); AscendC::MrgSortSrcList<half> srcList(vconcatWorkLocal[0], vconcatWorkLocal[workDataSize], vconcatWorkLocal[workDataSize * 2], vconcatWorkLocal[workDataSize * 3]); uint16_t elementLengths[4] = {singleDataSize, singleDataSize, singleDataSize, singleDataSize}; AscendC::MrgSort4Info srcInfo(elementLengths, false, 15, 1); AscendC::MrgSort4(dstLocal, srcList, srcInfo); outQueueDst.EnQue<half>(dstLocal); inQueueSrc.FreeTensor(srcLocal); workQueue.FreeTensor(vconcatWorkLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>(); AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize); outQueueDst.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc; AscendC::TQue<AscendC::QuePosition::VECIN, 1> workQueue; AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst; AscendC::GlobalTensor<half> srcGlobal, dstGlobal; int srcDataSize = 64; uint16_t singleDataSize = srcDataSize / 4; int dstDataSize = 512; int workDataSize = dstDataSize / 4; int repeat = srcDataSize / 4 / 16; int mode = 4; }; extern "C" __global__ __aicore__ void vec_proposal_kernel(__gm__ uint8_t* src, __gm__ uint8_t* dstGm) { KernelVecProposal op; op.Init(src, dstGm); op.Process(); }
Result example: Input (src_gm): [-38.1 82.7 -40.75 -54.62 21.67 -58.53 25.94 -79.5 -61.44 26.7 -27.45 48.78 86.75 -18.1 -58.8 62.38 46.38 -78.94 -87.7 -13.81 -13.25 46.94 -47.8 -50.44 34.16 20.3 80.1 -94.1 52.4 -42.75 83.4 80.44 -66.8 -82.7 -91.44 -95.6 66.2 -30.97 -36.53 61.66 24.92 -45.1 38.97 -34.62 -69.8 59.1 34.22 11.695 -33.47 52.1 -4.832 46.88 56.78 71.4 13.29 -35.78 52.44 -46.03 83.8 83.56 71.3 -9.086 -65.06 46.25 ] Output (dst_gm): [ 0. 0. 0. 0. 86.75 0. 0. 0. 0. 0. 0. 0. 83.8 0. 0. 0. 0. 0. 0. 0. 83.56 0. 0. 0. 0. 0. 0. 0. 83.4 0. 0. 0. 0. 0. 0. 0. 82.7 0. 0. 0. 0. 0. 0. 0. 80.44 0. 0. 0. 0. 0. 0. 0. 80.1 0. 0. 0. 0. 0. 0. 0. 71.4 0. 0. 0. 0. 0. 0. 0. 71.3 0. 0. 0. 0. 0. 0. 0. 66.2 0. 0. 0. 0. 0. 0. 0. 62.38 0. 0. 0. 0. 0. 0. 0. 61.66 0. 0. 0. 0. 0. 0. 0. 59.1 0. 0. 0. 0. 0. 0. 0. 56.78 0. 0. 0. 0. 0. 0. 0. 52.44 0. 0. 0. 0. 0. 0. 0. 52.4 0. 0. 0. 0. 0. 0. 0. 52.1 0. 0. 0. 0. 0. 0. 0. 48.78 0. 0. 0. 0. 0. 0. 0. 46.94 0. 0. 0. 0. 0. 0. 0. 46.88 0. 0. 0. 0. 0. 0. 0. 46.38 0. 0. 0. 0. 0. 0. 0. 46.25 0. 0. 0. 0. 0. 0. 0. 38.97 0. 0. 0. 0. 0. 0. 0. 34.22 0. 0. 0. 0. 0. 0. 0. 34.16 0. 0. 0. 0. 0. 0. 0. 26.7 0. 0. 0. 0. 0. 0. 0. 25.94 0. 0. 0. 0. 0. 0. 0. 24.92 0. 0. 0. 0. 0. 0. 0. 21.67 0. 0. 0. 0. 0. 0. 0. 20.3 0. 0. 0. 0. 0. 0. 0. 13.29 0. 0. 0. 0. 0. 0. 0. 11.695 0. 0. 0. 0. 0. 0. 0. -4.832 0. 0. 0. 0. 0. 0. 0. -9.086 0. 0. 0. 0. 0. 0. 0. -13.25 0. 0. 0. 0. 0. 0. 0. -13.81 0. 0. 0. 0. 0. 0. 0. -18.1 0. 0. 0. 0. 0. 0. 0. -27.45 0. 0. 0. 0. 0. 0. 0. -30.97 0. 0. 0. 0. 0. 0. 0. -33.47 0. 0. 0. 0. 0. 0. 0. -34.62 0. 0. 0. 0. 0. 0. 0. -35.78 0. 0. 0. 0. 0. 0. 0. -36.53 0. 0. 0. 0. 0. 0. 0. -38.1 0. 0. 0. 0. 0. 0. 0. -40.75 0. 0. 0. 0. 0. 0. 0. -42.75 0. 0. 0. 0. 0. 0. 0. -45.1 0. 0. 0. 0. 0. 0. 0. -46.03 0. 0. 0. 0. 0. 0. 0. -47.8 0. 0. 0. 0. 0. 0. 0. -50.44 0. 0. 0. 0. 0. 0. 0. -54.62 0. 0. 0. 0. 0. 0. 0. -58.53 0. 0. 0. 0. 0. 0. 0. -58.8 0. 0. 0. 0. 0. 0. 0. -61.44 0. 0. 0. 0. 0. 0. 0. -65.06 0. 0. 0. 0. 0. 0. 0. -66.8 0. 0. 0. 0. 0. 0. 0. -69.8 0. 0. 0. 0. 0. 0. 0. -78.94 0. 0. 0. 0. 0. 0. 0. -79.5 0. 0. 0. 0. 0. 0. 0. -82.7 0. 0. 0. 0. 0. 0. 0. -87.7 0. 0. 0. 0. 0. 0. 0. -91.44 0. 0. 0. 0. 0. 0. 0. -94.1 0. 0. 0. 0. 0. 0. 0. -95.6 0. 0. 0. ]