Sort

Applicability

Product	Supported
Atlas A3 training products/Atlas A3 inference products	√
Atlas A2 training products/Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	x
Atlas inference product's AI Core	√
Atlas inference product's Vector Core	x
Atlas training products	x

Function

Sorts data in descending order by value. The sorted data is saved in the following arrangement mode:

For the Atlas A3 training products/Atlas A3 inference products, method 1 is used.

For the Atlas A2 training products/Atlas A2 inference products, method 1 is used.

For the Atlas inference product's AI Core, method 2 is used.

Layout mode 1:
A maximum of 32 scores can be sorted in one iteration. The sorted scores and their corresponding indexes are stored in dst in the (score, index) struct. No matter whether the scores are of the half or float type, the (score, index) structure in dst always occupies 8 bytes of space. See the following examples:
- When the score type is float and the index type is uint32, in the computation result, the indexes are stored in the upper 4 bytes and the scores are stored in the lower 4 bytes.
- When the score type is half and the index type is uint32, in the computation result, the indexes are stored in the upper 4 bytes, the scores are stored in the lower 2 bytes, and the middle 2 bytes are reserved.

Layout mode 2: region proposal
The input and output data are Region Proposals. 16 Region Proposals are sorted in each iteration. Each Region Proposal consumes eight contiguous elements of the half or float data type. Its format is as follows:
1
[x1, y1, x2, y2, score, label, reserved_0, reserved_1]
When a Region Proposal is of the half data type, it occupies 16 bytes. Byte[15:12] is invalid and Byte[11:0] contains six half elements. To be specific, Byte[11:10] is defined as the label, Byte[9:8] is defined as the score, Byte[7:6] is defined as y2, Byte[5:4] is defined as x2, Byte[3:2] is defined as y1, and Byte[1:0] is defined as x1.

The table in the following figure contains 16 Region Proposals.

When a Region Proposal is of the float data type, it occupies 32 bytes. Byte[31:24] is invalid and Byte[23:0] contains six float elements. To be specific, Byte[23:20] is defined as the label, Byte[19:16] is defined as the score, Byte[15:12] is defined as y2, Byte[11:8] is defined as x2, Byte[7:4] is defined as y1, and Byte[3:0] is defined as x1.

The table in the following figure contains 16 Region Proposals.

Prototype

template <typename T, bool isFullSort>
__aicore__ inline void Sort(const LocalTensor<T>& dst, const LocalTensor<T>& concat, const LocalTensor<uint32_t>& index, LocalTensor<T>& tmp, const int32_t repeatTime)

Parameters

**Table 1** Template parameters
Parameter	Description
T	Data type of the operand. For the Atlas A3 training products/Atlas A3 inference products, the supported data types are half and float. For the Atlas A2 training products/Atlas A2 inference products, the supported data types are half and float. For the Atlas inference product's AI Core, the supported data types are half and float.
isFullSort	Whether to enable the full sorting mode. In full sorting mode, all inputs are sorted in descending order. For details about non-full sorting mode, see the description of repeatTime in Table 2.

**Table 2** Parameters
Parameter	Input/Output	Description
dst	Output	Destination operand, with shape [2n]. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned.
concat	Input	Source operand, that is, score in the API function description, with shape [n]. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. The source operand must have the same data type as the destination operand.
index	Input	Source operand, with shape [n]. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. This source operand is fixed at the uint32_t data type.
tmp	Input	Temporary space. This parameter is used to store intermediate variables during complex internal computation of the API. The temporary space is provided by developers. For details about how to obtain the size of the temporary space BufferSize, see GetSortTmpSize. The data type must be the same as that of the source operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned.
repeatTime	Input	Number of iteration repeats. The value is of the int32_t type. Atlas A3 training products/Atlas A3 inference products: 32 elements are sorted in each iteration. In the next iteration, 32 elements are skipped in concat and index, and 32 × 8 bytes are skipped in dst. Value range: repeatTime ∈ [0,255] Atlas A2 training products/Atlas A2 inference products: 32 elements are sorted in each iteration. In the next iteration, 32 elements are skipped in concat and index, and 32 × 8 bytes are skipped in dst. Value range: repeatTime ∈ [0,255] Atlas inference product's AI Core: 16 Region Proposals are sorted in each iteration. In the next iteration,16 Region Proposals are skipped in concat and dst. Value range: repeatTime ∈ [0,255]

Returns

None

Restrictions

When score[i] is the same as score[j], if i > j, score[j] is selected first. That is, the index sequence is the same as the input sequence.
In non-full sorting mode, data within each iteration is sorted, but data across different iterations is not sorted.
For details about the operand address alignment requirements, see General Address Alignment Restrictions.

Example

To obtain an operator sample project, click sort.

Processing 128 pieces of half-type data

This example applies to:

Atlas A2 training products/Atlas A2 inference products

Atlas A3 training products/Atlas A3 inference products

uint32_t elementCount = 128;
uint32_t m_sortRepeatTimes = m_elementCount / 32;
uint32_t m_extractRepeatTimes = m_elementCount / 32;
AscendC::Concat(concatLocal, valueLocal, concatTmpLocal, m_concatRepeatTimes);
AscendC::Sort<T, isFullSort>(sortedLocal, concatLocal, indexLocal, sortTmpLocal, m_sortRepeatTimes);
AscendC::Extract(dstValueLocal, dstIndexLocal, sortedLocal, m_extractRepeatTimes);

Result example:
Input data (srcValueGm): 128 pieces of half-type data
[31 30 29 ... 2 1 0
 63 62 61 ... 34 33 32
 95 94 93 ... 66 65 64
 127 126 125 ... 98 97 96]
Input data (srcIndexGm):
[31 30 29 ... 2 1 0
 63 62 61 ... 34 33 32
 95 94 93 ... 66 65 64
 127 126 125 ... 98 97 96]
Output data (dstValueGm):
[127 126 125 ... 2 1 0]
Output data (dstIndexGm):
[127 126 125 ... 2 1 0]

Processing 64 pieces of half-type data

This example applies to:

Atlas inference product's AI Core

uint32_t elementCount = 64;
uint32_t m_sortRepeatTimes = m_elementCount / 16;
uint32_t m_extractRepeatTimes = m_elementCount / 16;
AscendC::Concat(concatLocal, valueLocal, concatTmpLocal, m_concatRepeatTimes);
AscendC::Sort<T, isFullSort>(sortedLocal, concatLocal, indexLocal, sortTmpLocal, m_sortRepeatTimes);
AscendC::Extract(dstValueLocal, dstIndexLocal, sortedLocal, m_extractRepeatTimes);

Result example:
Input data (srcValueGm): 64 pieces of half-type data
[15 14 13 ... 2 1 0
 31 30 29 ... 18 17 16
 47 46 45 ... 34 33 32
 63 62 61 ... 50 49 48]
Input data (srcIndexGm):
[15 14 13 ... 2 1 0
 31 30 29 ... 18 17 16
 47 46 45 ... 34 33 32
 63 62 61 ... 50 49 48]
Output data (dstValueGm):
[63 62 61 ... 2 1 0]
Output data (dstIndexGm):
[63 62 61 ... 2 1 0]

Parent topic: Sorting Operations