ShiftRight
Applicability
|
Product |
Supported/Unsupported |
|---|---|
|
|
√ |
|
|
√ |
|
|
√ |
|
|
x |
|
|
x |
|
|
x |
Function Usage
Performs a right shift operation on each element in the source operand. The number of right shift bits is determined by the scalar in the formula. The right shift operation is classified into the following types based on the data type of the source operand:
- If the data type is unsigned, logical right shift is performed. In this case, the binary number is shifted rightward by a specified number of bits. The least significant bit is discarded, and the most significant bit is padded with 0. For example, after the binary number 1010101010101010 (uint16_t) is logically shifted rightward by one bit, the result is 0101010101010101.
- If the data type is signed, arithmetic right shift is performed. In this case, the binary number is shifted rightward by a specified number of bits. The least significant bit is discarded, and the sign bit is copied to the most significant bit. For example, after the binary number 1010101010101010 (int16_t) is arithmetically shifted rightward by one bit, the result is 1101010101010101; after the binary number is arithmetically shifted leftward by three bits, the result is 1111010101010101.

Prototype
- Computation of the first n data elements of a tensor
1 2
template <typename T, bool isSetMask = true> __aicore__ inline void ShiftRight(const LocalTensor<T>& dst, const LocalTensor<T>& src, const T& scalarValue, const int32_t& count)
- High-dimensional tensor sharding computation
- Bitwise mask mode
1 2
template <typename T, bool isSetMask = true> __aicore__ inline void ShiftRight(const LocalTensor<T>& dst, const LocalTensor<T>& src, const T& scalarValue, uint64_t mask[], const uint8_t repeatTime, const UnaryRepeatParams& repeatParams, bool roundEn = false)
- Contiguous mask mode
1 2
template <typename T, bool isSetMask = true> __aicore__ inline void ShiftRight(const LocalTensor<T>& dst, const LocalTensor<T>& src, const T& scalarValue, uint64_t mask, const uint8_t repeatTime, const UnaryRepeatParams& repeatParams, bool roundEn = false)
- Bitwise mask mode
- Computation of the first n data elements of a tensor
1 2
template <typename T, typename U, bool isSetMask = true, typename Std::enable_if<Std::is_same<PrimT<T>, U>::value, bool>::type = true> __aicore__ inline void ShiftRight(const LocalTensor<T>& dst, const LocalTensor<T>& src, const U& scalarValue, const int32_t& count)
- High-dimensional tensor sharding computation
- Bitwise mask mode
1 2
template <typename T, typename U, bool isSetMask = true, typename Std::enable_if<Std::is_same<PrimT<T>, U>::value, bool>::type = true> __aicore__ inline void ShiftRight(const LocalTensor<T>& dst, const LocalTensor<T>& src, const U& scalarValue, uint64_t mask[], const uint8_t repeatTime, const UnaryRepeatParams& repeatParams, bool roundEn)
- Contiguous mask mode
1 2
template <typename T, typename U, bool isSetMask = true, typename Std::enable_if<Std::is_same<PrimT<T>, U>::value, bool>::type = true> __aicore__ inline void ShiftRight(const LocalTensor<T>& dst, const LocalTensor<T>& src, const U& scalarValue, uint64_t mask, const uint8_t repeatTime, const UnaryRepeatParams& repeatParams, bool roundEn)
- Bitwise mask mode
Parameters
|
Parameter |
Description |
|---|---|
|
T |
Operand data type. For the For the For the |
|
U |
Data type of scalarValue. For the For the For the |
|
isSetMask |
Whether to set the mask mode and mask value inside the API.
For the following models, the isSetMask parameter in the API for calculating the first n pieces of data in a tensor does not take effect. Retain the default value.
|
|
Parameter |
Input/Output |
Meaning |
|---|---|---|
|
dst |
Output |
Destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. |
|
src |
Input |
Source operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. Its data type must be the same as that of the destination operand. |
|
scalarValue |
Input |
Shift distance. Its data type must be the same as that of the tensor element in the destination operand.
|
|
count |
Input |
Number of elements involved in the computation. |
|
mask/mask[] |
Input |
The mask parameter is used to control the elements involved in computation in each iteration.
|
|
repeatTime |
Input |
Number of iteration repeats. The Vector Unit reads 256 bytes of contiguous data for computation each time. To read the complete data for processing, the unit needs to read the input data in multiple repeats. repeatTime indicates the number of repeats. For details about this parameter, see High-dimensional Sharding APIs. |
|
repeatParams |
Input |
Control structure information of element operations. For details, see UnaryRepeatParams. |
|
roundEn |
Input |
A bool. Set to True to enable the rounding function. Set to False to disable the rounding function. This parameter is valid only when src is of type int16_t or int32_t. For example, with the rounding function enabled and src of type int16_t, the 5-bit arithmetic right shift of src is computed as follows: src_ele = 17 = 0b0000000000010001 (the fifth bit is 1) dst_ele = arithmetic_right_shift(src_ele, 5) + 1 = 0b0000000000000000 + 1 = 0b0000000000000001 For the |
Returns
None
Restrictions
- For details about the operand address alignment requirements, see General Address Alignment Restrictions.
- For details about the operand address overlapping restrictions, see General Address Overlap Restrictions.
Examples
- Example of high-dimensional tensor sharding computation (contiguous mask mode)
1 2 3 4 5 6
uint64_t mask = 128; int16_t scalar = 2; // repeatTime = 4. 128 elements are processed in a single iteration, and four iterations are required to compute 512 elements. // dstBlkStride, srcBlkStride = 1. The interval between src0 data addresses involved in calculation in each iteration is one data block, indicating that data is continuously read and written in a single iteration. // dstRepStride, srcRepStride = 8. The interval between addresses of adjacent iterations is eight data blocks, indicating that data is continuously read and written between adjacent iterations. AscendC::ShiftRight(dstLocal, srcLocal, scalar, mask, 4, { 1, 1, 8, 8 }, false);
- Example of high-dimensional tensor sharding computation (bitwise mask mode)
1 2 3 4 5 6
uint64_t mask[2] = { UINT64_MAX, UINT64_MAX }; int16_t scalar = 2; // repeatTime = 4. 128 elements are processed in a single iteration, and four iterations are required to compute 512 elements. // dstBlkStride, srcBlkStride = 1. The interval between src0 data addresses involved in calculation in each iteration is one data block, indicating that data is continuously read and written in a single iteration. // dstRepStride, srcRepStride = 8. The interval between addresses of adjacent iterations is eight data blocks, indicating that data is continuously read and written between adjacent iterations. AscendC::ShiftRight(dstLocal, srcLocal, scalar, mask, 4, {1, 1, 8, 8}, false);
- Example of computing the first n data elements of a tensor
1 2
int16_t scalar = 2; AscendC::ShiftRight(dstLocal, srcLocal, scalar, 512);
Input (srcLocal): [1 2 3 ... 512] Input (scalar): 2 Output (dstLocal): [0 0 0 1 1 1 1 ... 128]