LeakyRelu
Applicability
|
Product |
Supported/Unsupported |
|---|---|
|
|
√ |
|
|
√ |
|
|
√ |
|
|
√ |
|
|
x |
|
|
x |
Function Usage
Performs the Leaky Rectified Linear Unit (Leaky ReLU) operation by element. The calculation formula is as follows:

Leaky ReLU is a common activation function in artificial neural networks. Its mathematical expression is as follows:

The difference between Leaky ReLU and ReLU is that ReLU sets all negative values to zero, while Leaky ReLU assigns a slope to all negative values. The following figure shows the difference between ReLU and Leaky ReLU.


Prototype
- Computation of the first n pieces of data of a tensor
1 2
template <typename T, bool isSetMask = true> __aicore__ inline void LeakyRelu(const LocalTensor<T>& dst, const LocalTensor<T>& src, const T& scalarValue, const int32_t& count)
- High-dimensional tensor sharding computation
- Bitwise mask mode
1 2
template <typename T, bool isSetMask = true> __aicore__ inline void LeakyRelu(const LocalTensor<T>& dst, const LocalTensor<T>& src, const T& scalarValue, uint64_t mask[], const uint8_t repeatTime, const UnaryRepeatParams& repeatParams)
- Contiguous mask mode
1 2
template <typename T, bool isSetMask = true> __aicore__ inline void LeakyRelu(const LocalTensor<T>& dst, const LocalTensor<T>& src, const T& scalarValue, uint64_t mask, const uint8_t repeatTime, const UnaryRepeatParams& repeatParams)
- Bitwise mask mode
- Computation of the first n pieces of data of a tensor
1 2
template <typename T, typename U, bool isSetMask = true, typename Std::enable_if<Std::is_same<PrimT<T>, U>::value, bool>::type = true> __aicore__ inline void LeakyRelu(const LocalTensor<T>& dst, const LocalTensor<T>& src, const U& scalarValue, const int32_t& count)
- High-dimensional tensor sharding computation
- Bitwise mask mode
1 2
template <typename T, typename U, bool isSetMask = true, typename Std::enable_if<Std::is_same<PrimT<T>, U>::value, bool>::type = true> __aicore__ inline void LeakyRelu(const LocalTensor<T>& dst, const LocalTensor<T>& src, const U& scalarValue, uint64_t mask[], const uint8_t repeatTime, const UnaryRepeatParams& repeatParams)
- Contiguous mask mode
1 2
template <typename T, typename U, bool isSetMask = true, typename Std::enable_if<Std::is_same<PrimT<T>, U>::value, bool>::type = true> __aicore__ inline void LeakyRelu(const LocalTensor<T>& dst, const LocalTensor<T>& src, const U& scalarValue, uint64_t mask, const uint8_t repeatTime, const UnaryRepeatParams& repeatParams)
- Bitwise mask mode
Parameters
|
Parameter |
Description |
|---|---|
|
T |
Operand data type. For the For the For the For the |
|
U |
Data type of scalarValue. For the For the For the For the |
|
isSetMask |
Whether to set the mask mode and mask value inside the API.
For the following models, the isSetMask parameter in the API for calculating the first n pieces of data in a tensor does not take effect. Retain the default value.
|
|
Parameter |
Input/Output |
Meaning |
|---|---|---|
|
dst |
Output |
Destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. |
|
src |
Input |
Source operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The start address of the LocalTensor must be 32-byte aligned. Its data type must be the same as that of the destination operand. |
|
scalarValue |
Input |
Source operand, and its data type must be the same as the element type of the tensor in the destination operand. |
|
count |
Input |
Number of elements involved in the computation. |
|
mask/mask[] |
Input |
The mask parameter is used to control the elements involved in computation in each iteration.
|
|
repeatTime |
Input |
Number of iteration repeats. The Vector Unit reads 256 bytes of contiguous data for computation each time. To read the complete data for processing, the unit needs to read the input data in multiple repeats. repeatTime indicates the number of repeats. For details about this parameter, see High-dimensional Sharding APIs. |
|
repeatParams |
Input |
Control structure information of element operations. For details, see UnaryRepeatParams. |
Returns
None
Restrictions
- For details about the operand address alignment requirements, see General Address Alignment Restrictions.
- For details about the operand address overlapping restrictions, see General Address Overlap Restrictions.
Examples
For more examples, see here.
- Example of high-dimensional tensor sharding computation (contiguous mask mode)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57
#include "kernel_operator.h" class KernelBinaryScalar { public: __aicore__ inline KernelBinaryScalar() {} __aicore__ inline void Init(__gm__ uint8_t* src, __gm__ uint8_t* dstGm) { srcGlobal.SetGlobalBuffer((__gm__ half*)src); dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm); pipe.InitBuffer(inQueueSrc, 1, 512 * sizeof(half)); pipe.InitBuffer(outQueueDst, 1, 512 * sizeof(half)); } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>(); AscendC::DataCopy(srcLocal, srcGlobal, 512); inQueueSrc.EnQue(srcLocal); } __aicore__ inline void Compute() { AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>(); AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>(); uint64_t mask = 128; half scalar = 2; // repeatTime = 4, 128 elements one repeat, 512 elements total // dstBlkStride, srcBlkStride = 1, no gap between blocks in one repeat // dstRepStride, srcRepStride =8, no gap between repeats AscendC::LeakyRelu(dstLocal, srcLocal, scalar, mask, 4, {1, 1, 8, 8}); outQueueDst.EnQue<half>(dstLocal); inQueueSrc.FreeTensor(srcLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>(); AscendC::DataCopy(dstGlobal, dstLocal, 512); outQueueDst.FreeTensor(dstLocal); } private: AscendC::TPipe pipe; AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc; AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst; AscendC::GlobalTensor<half> srcGlobal, dstGlobal; }; extern "C" __global__ __aicore__ void binary_scalar_simple_kernel(__gm__ uint8_t* src, __gm__ uint8_t* dstGm) { KernelBinaryScalar op; op.Init(src, dstGm); op.Process(); }
- Example of high-dimensional tensor sharding computation (bitwise mask mode)
1 2 3 4 5 6
uint64_t mask[2] = { UINT64_MAX, UINT64_MAX }; half scalar = 2; // repeatTime = 4. 128 elements are processed in a single iteration, and four iterations are required to compute 512 elements. // dstBlkStride, srcBlkStride = 1. The interval between src0 data addresses involved in calculation in each iteration is one data block, indicating that data is continuously read and written in a single iteration. // dstRepStride, srcRepStride = 8. The interval between addresses of adjacent iterations is eight data blocks, indicating that data is continuously read and written between adjacent iterations. AscendC::LeakyRelu(dstLocal, srcLocal, scalar, mask, 4, {1, 1, 8, 8});
- Example of computing the first n data elements of a tensor
1 2
half scalar = 2; AscendC::LeakyRelu(dstLocal, srcLocal, scalar, 512);
Input (src0Local): [1 2 3 ... 512] Input (scalar): 2 Output (dstLocal): [1. 2. 3. ... 512.]