AddDeqRelu
Applicability
|
Product |
Supported/Unsupported |
|---|---|
|
|
√ |
|
|
√ |
|
|
x |
|
|
√ |
|
|
x |
|
|
x |
Function Usage
Adds inputs element-wise, performs Deq quantization on the result, and then performs ReLU calculation on the result (obtains the larger between the result and 0). The formula is as follows.

The formula for calculating Deq is as follows:

In the preceding formula, the value is divided by 2^17 and then multiplied by 2^17 to prevent overflow caused by multiplying x by DeqScale. DeqScale in the formula needs to be set by using SetDeqScale. For details, see SetDeqScale.
Prototype
- Computation of the first n pieces of data of a tensor
1__aicore__ inline void AddDeqRelu(const LocalTensor<half>& dst, const LocalTensor<int32_t>& src0, const LocalTensor<int32_t>& src1, const int32_t& count)
- High-dimensional tensor sharding computation
- Bitwise mask mode
1 2
template <bool isSetMask = true> __aicore__ inline void AddDeqRelu(const LocalTensor<half>& dst, const LocalTensor<int32_t>& src0, const LocalTensor<int32_t>& src1, uint64_t mask[], const uint8_t repeatTime, const BinaryRepeatParams& repeatParams)
- Contiguous mask mode
1 2
template <bool isSetMask = true> __aicore__ inline void AddDeqRelu(const LocalTensor<half>& dst, const LocalTensor<int32_t>& src0, const LocalTensor<int32_t>& src1, uint64_t mask, const uint8_t repeatTime, const BinaryRepeatParams& repeatParams)
- Bitwise mask mode
- Computation of the first n pieces of data of a tensor
1 2
template <typename T, typename U> __aicore__ inline void AddDeqRelu(const LocalTensor<T>& dst, const LocalTensor<U>& src0, const LocalTensor<U>& src1, const int32_t& count)
- High-dimensional tensor sharding computation
- Bitwise mask mode
1 2
template <typename T, typename U, bool isSetMask = true> __aicore__ inline void AddDeqRelu(const LocalTensor<T>& dst, const LocalTensor<U>& src0, const LocalTensor<U>& src1, uint64_t mask[], const uint8_t repeatTime, const BinaryRepeatParams& repeatParams)
- Contiguous mask mode
1 2
template <typename T, typename U, bool isSetMask = true> __aicore__ inline void AddDeqRelu(const LocalTensor<T>& dst, const LocalTensor<U>& src0, const LocalTensor<U>& src1, uint64_t mask, const uint8_t repeatTime, const BinaryRepeatParams& repeatParams)
- Bitwise mask mode
Parameters
|
Parameter |
Description |
|---|---|
|
isSetMask |
Indicates whether to set mask inside the API.
|
|
T |
Data type of the destination operand. For the For the For the |
|
U |
Data type of the source operand. For the For the For the |
|
Parameter |
Input/Output |
Description |
|---|---|---|
|
dst |
Output |
Destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. |
|
src0, src1 |
Input |
Source operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. |
|
count |
Input |
Number of elements involved in the computation. |
|
mask[]/mask |
Input |
The mask parameter is used to control the elements involved in computation in each iteration.
When the number of bits of the source operand is different from that of the destination operand, the data type with more bytes is used for the computation. |
|
repeatTime |
Input |
Number of iteration repeats. The Vector Unit reads 256 bytes of contiguous data for computation each time. To read the complete data for processing, the unit needs to read the input data in multiple repeats. repeatTime indicates the number of iteration repeats. For details about this parameter, see High-dimensional Sharding APIs. |
|
repeatParams |
Input |
Parameters that control the operand address strides. They are of the BinaryRepeatParams type, and contain such parameters as those that specify the address stride of the operand for the same data block between adjacent iterations and address stride of the operand between different data blocks in a single iteration. For details about the address stride parameters between adjacent iterations, see repeatStride. For details about the address stride parameters of DataBlock in the same iteration, see dataBlockStride. |
Returns
None
Constraints
- For details about the operand address alignment requirements, see General Address Alignment Restrictions.
- The destination operand and source operand cannot overlap in address.
Examples
In this example, srcLocal is of the int32_t type, and dstLocal is of the half type. The mask is calculated based on int32_t.
- Example of high-dimensional tensor sharding computation (contiguous mask mode)
1 2 3 4 5 6 7
uint64_t mask = 256 / sizeof(int32_t); // 64 // repeatTime = 4. 64 elements are computed in each iteration, and 256 elements are computed in total. // dstBlkStride, src0BlkStride, src1BlkStride = 1. Data is continuously read and written in a single repeat. // dstRepStride = 4, src0RepStride, src1RepStride = 8. Data is continuously read and written between adjacent iterations. half scale = 0.1; AscendC::SetDeqScale(scale); AscendC::AddDeqRelu(dstLocal, src0Local, src1Local, mask, 4, { 1, 1, 1, 4, 8, 8 });
- Example of high-dimensional tensor sharding computation (bitwise mask mode)
1 2 3 4 5 6 7
uint64_t mask[2] = { UINT64_MAX, UINT64_MAX }; // repeatTime = 4. 64 elements are computed in each iteration, and 256 elements are computed in total. // dstBlkStride, src0BlkStride, src1BlkStride = 1. Data is continuously read and written in a single repeat. // dstRepStride = 4, src0RepStride, src1RepStride = 8. Data is continuously read and written between adjacent iterations. half scale = 0.1; AscendC::SetDeqScale(scale); AscendC::AddDeqRelu(dstLocal, src0Local, src1Local, mask, 4, { 1, 1, 1, 4, 8, 8 });
- Example of computing the first n pieces of data of a tensor
1 2 3
half scale = 0.1; AscendC::SetDeqScale(scale); AscendC::AddDeqRelu(dstLocal, src0Local, src1Local, 512);
Input (src0Local): [70 36 43 54 28 49 27 82 95 ...] Input (src1Local): [19 33 34 50 42 2 97 93 99 ...] Output (dstLocal): [8.9 6.9 7.7 10.4 7.0 5.1 12.4 17.5 19.4 ...]