BroadCast

Function Usage

Broadcasts the input based on the output shape.

For example, if the shape of A is (2, 1) and the target shape after broadcasting is (2, 16), then the original single column will be expanded to 16 identical columns.

Input:
[[ 1]
 [ 2]]
Output:
[[ 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1]
 [ 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2]]

Principles

The figure below illustrates the internal algorithm block diagram of BroadCast high-level APIs, taking the float type, ND format, and broadcasting from [m, 1] to [m, k] as examples.

Figure 1 BroadCast algorithm block diagram

The computation process is divided into the following steps, all of which are performed on vectors:

brcb step: Broadcast each element as a data block.
Copy step: Copy each data block to multiple data blocks. In the k-aligned scenario, the result is y.
In the non-k-aligned scenario, use GatherMask to truncate [m, k] elements, where k' indicates the size obtained by padding k upwards to be 32-byte aligned.

Prototype

Pass the temporary space through the sharedTmpBuffer input parameter.

template <typename T, int32_t dim, int32_t axis, bool isReuseSource = false>
__aicore__ inline void BroadCast(LocalTensor<T> &dstLocal, const LocalTensor<T> &srcLocal, const uint32_t dstShape[dim], const uint32_t srcShape[dim], LocalTensor<uint8_t> &sharedTmpBuffer)

Allocate the temporary space through the API framework.

template <typename T, int32_t dim, int32_t axis, bool isReuseSource = false>
__aicore__ inline void BroadCast(LocalTensor<T> &dstLocal, const LocalTensor<T> &srcLocal, const uint32_t dstShape[dim], const uint32_t srcShape[dim])

This API requires extra temporary space to store intermediate variables during computation. The temporary space can be passed by developers through the sharedTmpBuffer input parameter or allocated through the API framework.

When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.
When the API framework is used for temporary space allocation, developers do not need to allocate the space, but must reserve the required size for the space.

If sharedTmpBuffer is used, developers must allocate space for the tensor. If the API framework is used, developers must reserve the temporary space. To obtain the size of the temporary space (BufferSize) to be reserved, use the API provided in GetBroadCastMaxMinTmpSize.

Parameters

**Table 1** Parameters in the template
Parameter	Function
T	Data type of the operand. Currently, uint8_t/int8_t/half/float is supported.
dim	Dimension of the input/output tensor. Currently, only 1-dimensional and 2-dimensional tensors are supported.
axis	Dimension to be broadcasted. Currently, only dimensions 0 and 1 are supported.
isReuseSource	Whether the source operand can be modified. This parameter is reserved. Pass the default value false.

**Table 2** API parameters
Parameter	Input/Output	Description
dstLocal	Output	Destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.
srcLocal	Input	Source operand. The source operand must have the same data type as the destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.
dstShape	Input	Shape of the output tensor. It is an array of the uint32_t type with a length of 1 or 2. The number of input shape dimensions must be the same as that of output shape dimensions.
srcShape	Input	Shape of the input tensor. It is an array of the uint32_t type with a length of 1 or 2. The number of input shape dimensions must be the same as that of output shape dimensions.
sharedTmpBuffer	Input	Temporary buffer. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. This parameter is used to store intermediate variables during complex computation in SwiGLU and is provided by developers. For details about how to obtain the temporary space size (BufferSize), see BroadCast Tiling.

Returns

None

Availability

Constraints

For details about the alignment requirements of the operand address offset, see General Restrictions.
The source operand address must not overlap the destination operand address.
Currently, only the ND format is supported.
Currently, dim can only be set to 1 or 2, and axis can only be set to 0 or 1.
When dim is 2 and axis is 0, srchShape[1] must be 32-byte aligned.

Example

#include "kernel_operator.h"

template <typename T, int32_t dim, int32_t axis>
class KernelBroadCast {
public:
    __aicore__ inline KernelBroadCast()
    {}
    __aicore__ inline void Init(
        GM_ADDR srcGm, GM_ADDR dstGm, const uint32_t dstShape[dim], const uint32_t srcShape[dim])
    {
        for (uint32_t i = 0; i < dim; i++) {
            srcSize *= srcShape[i];
            dstSize *= dstShape[i];
        }
        srcGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ T *>(srcGm), srcSize);
        dstGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ T *>(dstGm), dstSize);

        pipe.InitBuffer(inQueueX, 1, srcSize * sizeof(T));
        pipe.InitBuffer(outQueue, 1, dstSize * sizeof(T));
        dstShape_ = dstShape;
        srcShape_ = srcShape;
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }

private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<T> srcLocal = inQueueX.AllocTensor<T>();
        AscendC::DataCopy(srcLocal, srcGlobal, srcSize);
        inQueueX.EnQue(srcLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<T> dstLocal = outQueue.AllocTensor<T>();
        AscendC::LocalTensor<T> srcLocal = inQueueX.DeQue<T>();
        AscendC::BroadCast<T, dim, axis>(dstLocal, srcLocal, dstShape_, srcShape_);

        outQueue.EnQue<T>(dstLocal);
        inQueueX.FreeTensor(srcLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<T> dstLocal = outQueue.DeQue<T>();
        AscendC::DataCopy(dstGlobal, dstLocal, dstSize);
        outQueue.FreeTensor(dstLocal);
    }

private:
    AscendC::GlobalTensor<T> srcGlobal;
    AscendC::GlobalTensor<T> dstGlobal;

    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueX;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueue;
    const uint32_t *dstShape_{nullptr};
    const uint32_t *srcShape_{nullptr};
    int32_t srcSize{1};
    int32_t dstSize{1};
};

template <typename T, int32_t dim, int32_t axis>
__aicore__ void kernel_broadcast_operator(
    GM_ADDR srcGm, GM_ADDR dstGm, const uint32_t dstShape[dim], const uint32_t srcShape[dim])
{
    KernelBroadCast<T, dim, axis> op;
    op.Init(srcGm, dstGm, dstShape, srcShape);
    op.Process();
}

Result example:

Input (srcTensor):
[[ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [13]
 [14]
 [15]
 [16]]
Output (dstLocal):
[[ 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1]
 [ 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2]
 [ 3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3]
 [ 4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4]
 [ 5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5]
 [ 6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6]
 [ 7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7]
 [ 8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8]
 [ 9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9]
 [10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10]
 [11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11]
 [12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12]
 [13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13]
 [14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14]
 [15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15]
 [16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16]]

Parent topic: Data Padding