Sum

Function Description

Obtains the sum of elements in the last dimension.

If the input is a vector, the elements are added within the vector. If the input is a matrix, the elements in each row are summed along the last dimension. This API is limited to accepting inputs with a maximum of two dimensions and does not support inputs with higher dimensions.

As illustrated in the figure below, the operation is conducted on a two-dimensional matrix with a shape of (2, 3) and yields the output result of [[ 6] [15]].

To conduct the above operation, some essential concepts need to be put forth. The number of rows is referred to as the outer axis length (outter), and the actual number of elements in each row is dubbed the actual quantity of elements on the inner axis (n). The number of elements, which is converted after the byte length required for storing n elements is padded up to an integer multiple of 32, is referred to as the padded quantity of inner axis elements (inner). This API requires that the input inner axis length be an integer multiple of 32 bytes. If the byte length occupied by n is not a multiple of 32, developers need to pad it up to an integer multiple of 32. For example, in the following example, the element type is half, the actual number of elements (n) in each row is 3, and the occupied byte length is 6 bytes, which is not a multiple of 32 bytes. After padding up, 32 bytes are obtained, and the number of elements becomes 16. Therefore, outter = 2, n =3, and inner=16. In the figure, padding indicates the padding operation. The relationship between n and inner is as follows: inner = (n × sizeof(T) + 32 – 1) / 32 × 32 / sizeof(T).

Prototype

Pass the temporary space through the sharedTmpBuffer input parameter.

template <typename T, int32_t reduceDim = -1, bool isReuseSource = false, bool isBasicBlock = false>
__aicore__ inline void Sum(const LocalTensor<T> &dstTensor, const LocalTensor<T> &srcTensor, const LocalTensor<uint8_t> &sharedTmpBuffer, const SumParams &sumParams)

Allocate the temporary space through the API framework.

template <typename T, int32_t reduceDim = -1, bool isReuseSource = false, bool isBasicBlock = false>
__aicore__ inline void Sum(const LocalTensor<T> &dstTensor, const LocalTensor<T> &srcTensor, const SumParams &sumParams)

Due to the complex mathematical computation involved in the internal implementation of this API, additional temporary space is required to store intermediate variables generated during computation. The temporary space can be passed by developers through the sharedTmpBuffer input parameter or allocated through the API framework.

When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.
When the API framework is used for temporary space allocation, developers do not need to allocate the space, but must reserve the required size for the space.

If sharedTmpBuffer is used, developers must allocate space for the tensor. If the API framework is used, developers must reserve the temporary space. To obtain the size of the temporary space (BufferSize) to be reserved, use the API provided in Sum Tiling.

Parameters

**Table 1** Parameters in the template
Parameter	Description
T	Data type of the operand.
reduceDim	Dimension along which data is summed. This API is implemented based on the last dimension. The reduceDim parameter is not supported. Pass the default value -1.
isReuseSource	Whether the source operand can be modified. This parameter is reserved. Pass the default value false.
isBasicBlock	If the shape information and the tiling policy of both srcTensor and dstTensor meet the basic block requirements, this parameter can be enabled to improve performance. By default, this parameter is disabled. For basic blocks, the shapes of srcTensor and dstTensor must meet the following requirements: The length of the last axis (H axis) is a multiple of 64 but less than 2048. The length (*BS**) of a non-last axis is a multiple of 8.

Table 2 API parameters

Parameter

Input/Output

Description

dstTensor

Output

Destination operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The output value needs to be saved in a space with a size of outter * sizeof(T). Developers need to allocate the actual buffer space to dstTensor based on this size and the framework's alignment requirements.

NOTE:

The size of allocated buffer must be 32-byte aligned according to the framework's requirements. If the value of outter * sizeof(T) is not 32-byte aligned, it should be rounded up to the nearest multiple of 32 bytes. The extra buffer space allocated for alignment purposes should not be filled with values, but rather left with random values.

srcTensor

Input

Source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The source operand must have the same data type as the destination operand.

sharedTmpBuffer

Input

Temporary buffer.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

This parameter is used to store intermediate variables during complex computation in Sum and is provided by developers.

For details about how to obtain the temporary space size (BufferSize), see Sum Tiling.

sumParams

Input

Shape of srcLocal, SumParams type. The specific definition is as follows:

struct SumParams{
    uint32_t outter = 1;    // Outer axis length of the input.
    uint32_t inner;         // Number of padded elements on the inner axis of the input. The value of inner*sizeof(T) must be an integer multiple of 32 bytes.
    uint32_t n;             // Actual number of elements on the inner axis of the input.
};

The value of sumParams.inner*sizeof(T) must be an integer multiple of 32 bytes.
sumParams.inner is the value obtained by converting the sumParams.n byte size and padding it up to the nearest 32-aligned integer, where inner = (n × sizeof(T) + 32 – 1) / 32 × 32 / sizeof(T). Therefore, the size of sumParams.n should meet: 1 ≤ sumParams.n ≤ sumParams.inner.

Returns

None

Availability

Constraints

For details about the alignment requirements of the operand address offset, see General Restrictions.
The source operand address must not overlap the destination operand address.
The address of sharedTmpBuffer must not overlap the addresses of the source operand and destination operand.
Currently, only the ND format is supported.
For one-dimensional inputs, the outter value should be filled with 1. For two-dimensional inputs, fill in the outter and n values according to the actual situation, and calculate the inner value using the aforementioned formula. Failure to do so may result in incorrect functions.
srcTensor needs to be able to accommodate the space occupied by the data after inner axis alignment, and dstTensor needs to be able to accommodate the space occupied by the outter number of aligned results.
The underlying addition method used by Sum is consistent with that employed by ReduceSum and WholeReduceSum, both of which utilize a binary tree approach to add two elements at a time.
Supposing that the source operand consists of 128 pieces of float16 data, denoted as [data0, data1, data2, ..., data127], the calculation can be completed in one repeat. The process is as follows:
1. Add data0 and data1 to obtain data00, add data2 and data3 to obtain data01, ..., add data124 and data125 to obtain data62, and add data126 and data127 to obtain data63.
2. Add data00 and data01 to obtain data000, add data02 and data03 to obtain data001, ..., and add data62 and data63 to obtain data031.
3. By analogy, the destination operand is one piece of float16 data, denoted as [data].

Example

Implementation in the kernel: sum_custom.cpp:

#include "kernel_operator.h"
template <typename T>
class KernelSumCustom
{
public:
    __aicore__ inline KernelSumCustom() {}
    __aicore__ inline void Init(__gm__ uint8_t *src1Gm, __gm__ uint8_t *dstGm, uint32_t tmpSize, uint32_t outter, uint32_t inner, uint32_t n)
    {
        elementNumPerBlk = AscendC::ONE_BLK_SIZE / sizeof(T); // half=16 float=8
        elementNumPerRep = AscendC::ONE_REPEAT_BYTE_SIZE / sizeof(T);
        width = inner;
        height = outter;
        nNum = n;
        src1Global.SetGlobalBuffer((__gm__ T *)src1Gm);
        dstGlobal.SetGlobalBuffer((__gm__ T *)dstGm);
        pipe.InitBuffer(inQueueSrc1, 1, height * width * sizeof(T));
        pipe.InitBuffer(outQueueDst, 1, (height * sizeof(T) + AscendC::ONE_BLK_SIZE - 1) / AscendC::ONE_BLK_SIZE * AscendC::ONE_BLK_SIZE);
        pipe.InitBuffer(workQueue, 1, tmpSize);
    }

    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }

private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<T> srcLocal1 = inQueueSrc1.AllocTensor<T>();
        AscendC::DataCopy(srcLocal1, src1Global, height * width);
        inQueueSrc1.EnQue(srcLocal1);
    }

    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<T> srcLocal1 = inQueueSrc1.DeQue<T>();
        AscendC::LocalTensor<uint8_t> workLocal = workQueue.AllocTensor<uint8_t>();
        AscendC::LocalTensor<T> dstLocal = outQueueDst.AllocTensor<T>();

        AscendC::SumParams params{height, width, nNum};
        T scalar(0);
        AscendC::Duplicate<T>(dstLocal, scalar, (height * sizeof(T) + AscendC::ONE_BLK_SIZE - 1) / AscendC::ONE_BLK_SIZE * AscendC::ONE_BLK_SIZE / sizeof(T));
        AscendC::Sum(dstLocal, srcLocal1, params);

        outQueueDst.EnQue<T>(dstLocal);
        workQueue.FreeTensor(workLocal);
        inQueueSrc1.FreeTensor(srcLocal1);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<T> dstLocal = outQueueDst.DeQue<T>();
        AscendC::DataCopy(dstGlobal, dstLocal, (height * sizeof(T) + AscendC::ONE_BLK_SIZE - 1) / AscendC::ONE_BLK_SIZE * AscendC::ONE_BLK_SIZE / sizeof(T));
        outQueueDst.FreeTensor(dstLocal);
    }

private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc1;
    AscendC::TQue<AscendC::QuePosition::VECIN, 1> workQueue;
    AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst;
    AscendC::GlobalTensor<T> src1Global, dstGlobal;
    uint32_t elementNumPerBlk = 0;
    uint32_t elementNumPerRep = 0;
    uint32_t width = 0;
    uint32_t height = 0;
    uint32_t nNum = 0;
};

extern "C" __global__ __aicore__ void sum_custom(GM_ADDR x, GM_ADDR y, GM_ADDR workspace, GM_ADDR tiling)
{
    GET_TILING_DATA(tiling_data, tiling);
    KernelSumCustom<DTYPE_X> op;
    op.Init(x, y, tiling_data.size, tiling_data.outter, tiling_data.inner, tiling_data.n);
    op.Process();
}

Implementation on the host: sum_custom_tiling.h:

#include "register/tilingdata_base.h"

namespace optiling
{
  BEGIN_TILING_DATA_DEF(SumCustomTilingData)
  TILING_DATA_FIELD_DEF(uint32_t, size);
  TILING_DATA_FIELD_DEF(uint32_t, outter);
  TILING_DATA_FIELD_DEF(uint32_t, inner);
  TILING_DATA_FIELD_DEF(uint32_t, n);
  END_TILING_DATA_DEF;

  REGISTER_TILING_DATA_CLASS(SumCustom, SumCustomTilingData)
}

sum_custom.cpp:

#include "sum_custom_tiling.h"
#include "register/op_def_registry.h"
#include "tiling/tiling_api.h"

namespace optiling
{
    static ge::graphStatus TilingFunc(gert::TilingContext *context)
    {
        SumCustomTilingData tiling;
        const gert::RuntimeAttrs *sumattrs = context->GetAttrs();
        const uint32_t tmp = *(sumattrs->GetAttrPointer<uint32_t>(0));
        const uint32_t outter = *(sumattrs->GetAttrPointer<uint32_t>(1));
        const uint32_t inner = *(sumattrs->GetAttrPointer<uint32_t>(2));
        uint32_t n = *(sumattrs->GetAttrPointer<uint32_t>(3));
        auto dt = context->GetInputTensor(0)->GetDataType();

  tiling.set_outter(outter); // configuration in the structure
        tiling.set_inner(inner);
        tiling.set_n(n);
        uint32_t maxValue = 0;
        uint32_t minValue = 0;
        uint32_t dtypesize;
        if (dt == ge::DT_FLOAT16) {
            dtypesize = 2;
        } else {
            dtypesize = 4;
        }
        AscendC::GetSumMaxMinTmpSize(n, dtypesize, false, maxValue, minValue);
        tiling.set_size(minValue);

        context->SetBlockDim(8);
        tiling.SaveToBuffer(context->GetRawTilingData()->GetData(), context->GetRawTilingData()->GetCapacity());
        context->GetRawTilingData()->SetDataSize(tiling.GetDataSize());
        size_t *currentWorkspace = context->GetWorkspaceSizes(1);
        currentWorkspace[0] = 0;
        return ge::GRAPH_SUCCESS;
    }
}

namespace ge
{
    static ge::graphStatus InferShape(gert::InferShapeContext *context)
    {
        const gert::Shape *x1_shape = context->GetInputShape(0);
        gert::Shape *y_shape = context->GetOutputShape(0);
        *y_shape = *x1_shape;
        return GRAPH_SUCCESS;
    }
}

namespace ops
{
    class SumCustom : public OpDef
    {
    public:
        explicit SumCustom(const char *name) : OpDef(name)
        {
            this->Input("x")
                .ParamType(REQUIRED)
                .DataType({ge::DT_FLOAT16})
                .Format({ge::FORMAT_ND});
            this->Output("y")
                .ParamType(REQUIRED)
                .DataType({ge::DT_FLOAT16})
                .Format({ge::FORMAT_ND});
            this->Attr("size")
                .AttrType(REQUIRED)
                .Int(0);
            this->Attr("outter")
                .AttrType(REQUIRED)
                .Int(0);
            this->Attr("inner")
                .AttrType(REQUIRED)
                .Int(0);
            this->Attr("n")
                .AttrType(REQUIRED)
                .Int(0);
            this->SetInferShape(ge::InferShape);
            this->AICore()
                .SetTiling(optiling::TilingFunc);
            this->AICore().AddConfig("ascendxxx"); // Replace ascendxxx with the actual Ascend AI Processor model.
        }
    };
    OP_ADD(SumCustom);
} // namespace ops

Result example:

If the input is two-dimensional data with a size of 2 × 3 and an element type of half, then outter is 2, n is 3, sizeof(T) is 2, and inner is 16 {(3 × 2 + 32 – 1)/32 * 32 / 2 = 16}.

Input (srcLocal): [[1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0],
                     [4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 0]]
Output (dstLocal): [6 15 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Parent topic: Sum