Broadcast

Applicability

Product	Supported
Atlas A3 training products / Atlas A3 inference products	√
Atlas A2 training products / Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	x
Atlas inference product 's AI Core	√
Atlas inference product 's Vector Core	x
Atlas training products	x

Function

Broadcasts the input based on the output shape.

For example, if the shape of A is (2, 1) and the target shape after broadcasting is (2, 16), then the original single column will be expanded to 16 identical columns.

      
           Input:
[[ 1]
 [ 2]]
Output:
[[ 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1]
 [ 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2]]

Principles

The figure below illustrates the internal algorithm block diagram of Broadcast high-level APIs, taking the float type, ND format, and broadcasting from [m, 1] to [m, k] as examples.

Figure 1 Broadcast algorithm block diagram
Click to enlarge

The computation process is divided into the following steps, all of which are performed on vectors:

brcb step: Broadcast each element as a data block.
Copy step: Copy each data block to multiple data blocks. In the k-aligned scenario, the result is y.
In the non-k-aligned scenario, use GatherMask to truncate [m, k] elements, where k' indicates the size obtained by padding k upwards to be 32-byte aligned.

Prototype

Pass the temporary space through the sharedTmpBuffer input parameter.

        
             template <typename T, int32_t dim, int32_t axis, bool isReuseSource = false>
__aicore__ inline void Broadcast(const LocalTensor<T>& dstLocal, const LocalTensor<T>& srcLocal, const uint32_t dstShape[dim], const uint32_t srcShape[dim], LocalTensor<uint8_t>& sharedTmpBuffer)

Allocate the temporary space through the API framework.

        
             template <typename T, int32_t dim, int32_t axis, bool isReuseSource = false>
__aicore__ inline void Broadcast(const LocalTensor<T>& dstLocal, const LocalTensor<T>& srcLocal, const uint32_t dstShape[dim], const uint32_t srcShape[dim])

This API requires extra temporary space to store intermediate variables during computation. The temporary space can be passed through the sharedTmpBuffer input parameter or allocated through the API framework.

When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables you to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.
When the API framework is used for temporary space allocation, you do not need to allocate the space, but must reserve the required size for the space.

If sharedTmpBuffer is used, you must allocate the tensor space. If the API framework is used, you must reserve the temporary space. To obtain the size of the temporary space (BufferSize) to be reserved, use the API provided in GetBroadCastMaxMinTmpSize.

Parameters

**Table 1** Template parameters
Parameter	Function
T	Data type of the operand. For the Atlas A3 training products / Atlas A3 inference products , the supported data types are int8_t, uint8_t, half, and float. For the Atlas A2 training products / Atlas A2 inference products , the supported data types are int8_t, uint8_t, half, and float. For the Atlas inference product 's AI Core, the supported data types are int8_t, uint8_t, half, and float.
dim	Dimension of the input/output tensor. Currently, only 1-dimensional and 2-dimensional tensors are supported.
axis	Dimension to be broadcasted. Currently, only dimensions 0 and 1 are supported. The parameter value 0 indicates that the first dimension is to be broadcasted, and the parameter value 1 indicates that the second dimension is to be broadcasted.
isReuseSource	Whether the source operand can be modified. This parameter is reserved. Pass the default value false.

**Table 2** API parameters
Parameter	Input/Output	Description
dstLocal	Output	Destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.
srcLocal	Input	Source operand. The source operand must have the same data type as the destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.
dstShape	Input	Shape of the output tensor. It is an array of the uint32_t type with a length of 1 or 2. The number of input shape dimensions must be the same as that of output shape dimensions.
srcShape	Input	Shape of the input tensor. It is an array of the uint32_t type with a length of 1 or 2. The number of input shape dimensions must be the same as that of output shape dimensions.
sharedTmpBuffer	Input	Temporary buffer. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. This parameter is used to store intermediate variables during complex computation in Broadcast and is provided by developers. For details about how to obtain the temporary space size (BufferSize), see GetBroadCastMaxMinTmpSize.

Returns

None

Restrictions

For details about the operand address alignment requirements, see General Address Alignment Restrictions.
The source operand address must not overlap the destination operand address.
Currently, only the ND format is supported.
Currently, dim can only be set to 1 or 2, and axis can only be set to 0 or 1.
For the Atlas inference product 's AI Core, when dim is 2 and axis is 1, srcShape[0] must be 32-byte aligned. That is, when the input/output tensor has two dimensions and the broadcast dimension is 1, the data in dimension 0 of the input tensor must be a multiple of 32 bytes.
When dim is 2 and axis is 0, srcShape[1] must be 32-byte aligned.

Example

For more operator examples, see broadcast operator sample.

      
       
         
         
           #include "kernel_operator.h"

template <typename T, int32_t dim, int32_t axis>
class KernelBroadcast {
public:
    __aicore__ inline KernelBroadcast()
    {}
    __aicore__ inline void Init(
        GM_ADDR srcGm, GM_ADDR dstGm, const uint32_t dstShape[dim], const uint32_t srcShape[dim])
    {
        for (uint32_t i = 0; i < dim; i++) {
            srcSize *= srcShape[i];
            dstSize *= dstShape[i];
        }
        srcGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ T *>(srcGm), srcSize);
        dstGlobal.SetGlobalBuffer(reinterpret_cast<__gm__ T *>(dstGm), dstSize);

        pipe.InitBuffer(inQueueX, 1, srcSize * sizeof(T));
        pipe.InitBuffer(outQueue, 1, dstSize * sizeof(T));
        dstShape_ = dstShape;
        srcShape_ = srcShape;
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }

private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<T> srcLocal = inQueueX.AllocTensor<T>();
        AscendC::DataCopy(srcLocal, srcGlobal, srcSize);
        inQueueX.EnQue(srcLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<T> dstLocal = outQueue.AllocTensor<T>();
        AscendC::LocalTensor<T> srcLocal = inQueueX.DeQue<T>();
        AscendC::Broadcast<T, dim, axis>(dstLocal, srcLocal, dstShape_, srcShape_);

        outQueue.EnQue<T>(dstLocal);
        inQueueX.FreeTensor(srcLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<T> dstLocal = outQueue.DeQue<T>();
        AscendC::DataCopy(dstGlobal, dstLocal, dstSize);
        outQueue.FreeTensor(dstLocal);
    }

private:
    AscendC::GlobalTensor<T> srcGlobal;
    AscendC::GlobalTensor<T> dstGlobal;

    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueX;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueue;
    const uint32_t *dstShape_{nullptr};
    const uint32_t *srcShape_{nullptr};
    int32_t srcSize{1};
    int32_t dstSize{1};
};

template <typename T, int32_t dim, int32_t axis>
__aicore__ void kernel_broadcast_operator(
    GM_ADDR srcGm, GM_ADDR dstGm, const uint32_t dstShape[dim], const uint32_t srcShape[dim])
{
    KernelBroadcast<T, dim, axis> op;
    op.Init(srcGm, dstGm, dstShape, srcShape);
    op.Process();
}

          

        

      
     

Result example:

       
        
          
          
            Input (srcLocal):
[[ 1]
 [ 2]
 [ 3]
 [ 4]
 [ 5]
 [ 6]
 [ 7]
 [ 8]
 [ 9]
 [10]
 [11]
 [12]
 [13]
 [14]
 [15]
 [16]]
dim: 2
axis: 1
Output (dstLocal):
[[ 1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1]
 [ 2  2  2  2  2  2  2  2  2  2  2  2  2  2  2  2]
 [ 3  3  3  3  3  3  3  3  3  3  3  3  3  3  3  3]
 [ 4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4]
 [ 5  5  5  5  5  5  5  5  5  5  5  5  5  5  5  5]
 [ 6  6  6  6  6  6  6  6  6  6  6  6  6  6  6  6]
 [ 7  7  7  7  7  7  7  7  7  7  7  7  7  7  7  7]
 [ 8  8  8  8  8  8  8  8  8  8  8  8  8  8  8  8]
 [ 9  9  9  9  9  9  9  9  9  9  9  9  9  9  9  9]
 [10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10]
 [11 11 11 11 11 11 11 11 11 11 11 11 11 11 11 11]
 [12 12 12 12 12 12 12 12 12 12 12 12 12 12 12 12]
 [13 13 13 13 13 13 13 13 13 13 13 13 13 13 13 13]
 [14 14 14 14 14 14 14 14 14 14 14 14 14 14 14 14]
 [15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15]
 [16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16]]

           

         

       
      

Parent topic: Tensor Transformation