BroadCast
Function Usage
Broadcasts the input based on the output shape.
For example, if the shape of A is (2, 1) and the target shape after broadcasting is (2, 16), then the original single column will be expanded to 16 identical columns.
1 2 3 4 5 6 | Input: [[ 1] [ 2]] Output: [[ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] [ 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2]] |
Principles
The figure below illustrates the internal algorithm block diagram of BroadCast high-level APIs, taking the float type, ND format, and broadcasting from [m, 1] to [m, k] as examples.

The computation process is divided into the following steps, all of which are performed on vectors:
- brcb step: Broadcast each element as a data block.
- Copy step: Copy each data block to multiple data blocks. In the k-aligned scenario, the result is y.
- In the non-k-aligned scenario, use GatherMask to truncate [m, k] elements, where k' indicates the size obtained by padding k upwards to be 32-byte aligned.
Prototype
- Pass the temporary space through the sharedTmpBuffer input parameter.
1 2
template <typename T, int32_t dim, int32_t axis, bool isReuseSource = false> __aicore__ inline void BroadCast(LocalTensor<T> &dstLocal, const LocalTensor<T> &srcLocal, const uint32_t dstShape[dim], const uint32_t srcShape[dim], LocalTensor<uint8_t> &sharedTmpBuffer)
- Allocate the temporary space through the API framework.
1 2
template <typename T, int32_t dim, int32_t axis, bool isReuseSource = false> __aicore__ inline void BroadCast(LocalTensor<T> &dstLocal, const LocalTensor<T> &srcLocal, const uint32_t dstShape[dim], const uint32_t srcShape[dim])
This API requires extra temporary space to store intermediate variables during computation. The temporary space can be passed by developers through the sharedTmpBuffer input parameter or allocated through the API framework.
- When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.
- When the API framework is used for temporary space allocation, developers do not need to allocate the space, but must reserve the required size for the space.
If sharedTmpBuffer is used, developers must allocate space for the tensor. If the API framework is used, developers must reserve the temporary space. To obtain the size of the temporary space (BufferSize) to be reserved, use the API provided in GetBroadCastMaxMinTmpSize.
Parameters
Parameter |
Function |
|---|---|
T |
Data type of the operand. Currently, uint8_t/int8_t/half/float is supported. |
dim |
Dimension of the input/output tensor. Currently, only 1-dimensional and 2-dimensional tensors are supported. |
axis |
Dimension to be broadcasted. Currently, only dimensions 0 and 1 are supported. |
isReuseSource |
Whether the source operand can be modified. This parameter is reserved. Pass the default value false. |
Parameter |
Input/Output |
Description |
|---|---|---|
dstLocal |
Output |
Destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. |
srcLocal |
Input |
Source operand. The source operand must have the same data type as the destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. |
dstShape |
Input |
Shape of the output tensor. It is an array of the uint32_t type with a length of 1 or 2. The number of input shape dimensions must be the same as that of output shape dimensions. |
srcShape |
Input |
Shape of the input tensor. It is an array of the uint32_t type with a length of 1 or 2. The number of input shape dimensions must be the same as that of output shape dimensions. |
sharedTmpBuffer |
Input |
Temporary buffer. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. This parameter is used to store intermediate variables during complex computation in SwiGLU and is provided by developers. For details about how to obtain the temporary space size (BufferSize), see BroadCast Tiling. |
Returns
None
Availability
Constraints
- For details about the alignment requirements of the operand address offset, see General Restrictions.
- The source operand address must not overlap the destination operand address.
- Currently, only the ND format is supported.
- Currently, dim can only be set to 1 or 2, and axis can only be set to 0 or 1.
- When dim is 2 and axis is 0, srchShape[1] must be 32-byte aligned.
Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | #include "kernel_operator.h" template <typename T, int32_t dim, int32_t axis> class KernelBroadCast { public: __aicore__ inline KernelBroadCast() {} __aicore__ inline void Init( GM_ADDR srcGm, GM_ADDR dstGm, const uint32_t dstShape[dim], const uint32_t srcShape[dim]) { for (uint32_t i = 0; i < dim; i++) { srcSize *= srcShape[i]; dstSize *= dstShape[i]; } srcGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ T *>(srcGm), srcSize); dstGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ T *>(dstGm), dstSize); pipe.InitBuffer(inQueueX, 1, srcSize * sizeof(T)); pipe.InitBuffer(outQueue, 1, dstSize * sizeof(T)); dstShape_ = dstShape; srcShape_ = srcShape; } __aicore__ inline void Process() { CopyIn(); Compute(); CopyOut(); } private: __aicore__ inline void CopyIn() { AscendC::LocalTensor<T> srcLocal = inQueueX.AllocTensor<T>(); AscendC::DataCopy(srcLocal, srcGlobal, srcSize); inQueueX.EnQue(srcLocal); } __aicore__ inline void Compute() { AscendC::LocalTensor<T> dstLocal = outQueue.AllocTensor<T>(); AscendC::LocalTensor<T> srcLocal = inQueueX.DeQue<T>(); AscendC::BroadCast<T, dim, axis>(dstLocal, srcLocal, dstShape_, srcShape_); outQueue.EnQue<T>(dstLocal); inQueueX.FreeTensor(srcLocal); } __aicore__ inline void CopyOut() { AscendC::LocalTensor<T> dstLocal = outQueue.DeQue<T>(); AscendC::DataCopy(dstGlobal, dstLocal, dstSize); outQueue.FreeTensor(dstLocal); } private: AscendC::GlobalTensor<T> srcGlobal; AscendC::GlobalTensor<T> dstGlobal; AscendC::TPipe pipe; AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueX; AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueue; const uint32_t *dstShape_{nullptr}; const uint32_t *srcShape_{nullptr}; int32_t srcSize{1}; int32_t dstSize{1}; }; template <typename T, int32_t dim, int32_t axis> __aicore__ void kernel_broadcast_operator( GM_ADDR srcGm, GM_ADDR dstGm, const uint32_t dstShape[dim], const uint32_t srcShape[dim]) { KernelBroadCast<T, dim, axis> op; op.Init(srcGm, dstGm, dstShape, srcShape); op.Process(); } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 | Input (srcTensor): [[ 1] [ 2] [ 3] [ 4] [ 5] [ 6] [ 7] [ 8] [ 9] [10] [11] [12] [13] [14] [15] [16]] Output (dstLocal): [[ 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1] [ 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2] [ 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3] [ 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4] [ 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5] [ 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6] [ 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7] [ 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8 8] [ 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9 9] [10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10] [11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11] [12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12] [13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13] [14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14] [15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15] [16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16]] |