Mean

Applicability

Product	Supported
Atlas A3 training products/Atlas A3 inference products	√
Atlas A2 training products/Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	x
Atlas inference product's AI Core	√
Atlas inference product's Vector Core	x
Atlas training products	x

Function

Calculates the mean of elements according to the direction of the last axis.

If the input is a vector, the mean of elements is calculated within the vector. If the input is a matrix, the mean of elements is calculated along the last dimension. This API supports input of data no more than two dimensions.

As illustrated in the figure below, the mean operation is conducted on a two-dimensional matrix with a shape of (4, 5) and yields the output result of [3, 8, 13, 18].

To understand the specific functions of the API, some essential concepts need to be put forth. The number of rows of data is referred to as the outer axis length (outter), and the actual number of elements in each row is dubbed the actual quantity of elements on the inner axis (n). The number of elements obtained after the actual number of elements on the inner axis (n) is aligned with 32 bytes is referred to as the padded quantity of inner axis elements (inner). This API requires that the input inner axis length be 32-byte aligned. If the byte length occupied by n is not a multiple of 32 bytes, you need to pad it to an integer multiple of 32 bytes. In the following example, the element type is float, the actual number of elements (n) in each row is 5, and the occupied byte length is 20 bytes, which is not a multiple of 32 bytes. After padding up, 32 bytes are obtained, and the corresponding number of elements becomes 8. In the figure, padding indicates the padding operation. The relationship between n and inner is as follows: inner = (n x sizeof(T) + 32 –1)/32 x 32/sizeof(T).

Prototype

Pass to the temporary space through the sharedTmpBuffer input parameter.

template <typename T, typename accType = T, bool isReuseSource = false, bool isBasicBlock = false, int32_t reduceDim = -1>
__aicore__ inline void Mean(const LocalTensor<T>& dstTensor, const LocalTensor<T>& srcTensor, const LocalTensor<uint8_t>& sharedTmpBuffer, const MeanParams& meanParams)

Allocate the temporary space through the API framework.

template <typename T, typename accType = T, bool isReuseSource = false, bool isBasicBlock = false, int32_t reduceDim = -1>
__aicore__ inline void Mean(const LocalTensor<T>& dstTensor, const LocalTensor<T>& srcTensor, const MeanParams& meanParams)

Due to the complex mathematical computation involved in the internal implementation of this API, additional temporary space is required to store intermediate variables generated during computation. The temporary space can be passed by developers through the sharedTmpBuffer input parameter or allocated through the API framework.

When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables you to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.
When the API framework is used for temporary space allocation, you do not need to allocate the space, but must reserve the required size for the temporary space.

If sharedTmpBuffer is used, you must allocate space for the tensor. If the API framework is used, you must reserve the temporary space. To obtain the size of the temporary space (BufferSize) to be reserved, use the API provided in GetMeanMaxMinTmpSize.

Parameters

**Table 1** Template parameters
Parameter	Description
T	Data type of the operand. For the Atlas A3 training products/Atlas A3 inference products, the supported data types are half and float. For the Atlas A2 training products/Atlas A2 inference products, the supported data types are half and float. For the Atlas inference product's AI Core, the supported data types are half and float.
accType	Data type actually involved in computation. If the set accType precision is higher than that of the input T, the input will be converted to accType before the computation. After the computation is complete, the input will be converted back to the original data type. Setting the accType value to increase precision can prevent data type overflows. When T is of half type, you can set accType to float, which means that the input half precision will be up-cast to float precision for computation. The precision of accType cannot be lower than that of T.
isReuseSource	Whether the source operand can be modified. This parameter is reserved. Pass the default value false.
isBasicBlock	Reserved parameter, not supported currently.
reduceDim	Dimension along which data is summed. This API is implemented based on the last dimension. The reduceDim parameter is not supported. Pass the default value –1.

Table 2 API parameters

Parameter

Input/Output

Description

dstTensor

Output

Destination operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The output value needs to be saved in a space with a size of outter x sizeof(T). You need to allocate the actual buffer space to dstTensor based on this size and the framework's alignment requirements.

srcTensor

Input

Source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The source operand must have the same data type as the destination operand.

The shape of input data is outter x inner. You need to allocate a space with a size of outter x inner x sizeof(T) for it.

sharedTmpBuffer

Input

Temporary buffer.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

This parameter is used to store intermediate variables during complex computation in Mean and is provided by developers.

For details about how to obtain the temporary space size (BufferSize), see GetMeanMaxMinTmpSize.

MeanParams

Input

Shape of srcTensor, MeanParams type. The specific definition is as follows:

struct MeanParams{
    uint32_t outter = 1;    // outer axis length of input data.
    uint32_t inner;         // number of elements obtained after the actual number of elements on the inner axis of input data is aligned with 32 bytes. The value of inner x sizeof(T) must be an integer multiple of 32 bytes.
    uint32_t n;             // actual number of elements on the inner axis of input data.
};

The value of MeanParams.inner x sizeof(T) must be an integer multiple of 32 bytes.
MeanParams.inner is the value of MeanParams.n after 32-byte alignment upwards, where inner = (n x sizeof(T) + 32 –1)/32 x 32/sizeof(T). Therefore, the size of MeanParams.n should satisfy: 1 ≤ MeanParams.n ≤ MeanParams.inner.

Returns

None

Restrictions

For details about the operand address alignment requirements, see General Address Alignment Restrictions.
The source operand address must not overlap the destination operand address.
sharedTmpBuffer must not overlap the addresses of the source operand and destination operand.
Currently, only the ND format is supported.
For Mean, the sum is performed first and then the division is performed. The internal bottom-layer addition mode used during the summation is the same as that of Sum, ReduceSum, and WholeReduceSum. The binary tree mode is used to add two elements at a time. For details, see Sum.

Example

Call example in the kernel

#include "kernel_operator.h"
template <typename T, typename accType>
class KernelMean
{
public:
    __aicore__ inline KernelMean() {}
    __aicore__ inline void Init(__gm__ uint8_t *srcGm, __gm__ uint8_t *dstGm, uint32_t outter, uint32_t inner, uint32_t n, uint32_t Size)
    {
        meanParams.outter = outter;
        meanParams.inner = inner;
        meanParams.n = n;
        tmpSize = Size;
        srcGlobal.SetGlobalBuffer((__gm__ T *)srcGm);
        dstGlobal.SetGlobalBuffer((__gm__ T *)dstGm);
        pipe.InitBuffer(inQueueX, 1, meanParams.outter * meanParams.inner * sizeof(T));
        pipe.InitBuffer(outQueueY, 1, (meanParams.outter * sizeof(T) + AscendC::ONE_BLK_SIZE - 1) / AscendC::ONE_BLK_SIZE * AscendC::ONE_BLK_SIZE);
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }

private:
    __aicore__ inline void CopyIn()
    {
        AscendC::LocalTensor<T> srcLocal = inQueueX.AllocTensor<T>();
        AscendC::DataCopy(srcLocal, srcGlobal, meanParams.outter * meanParams.inner);
        inQueueX.EnQue(srcLocal);
    }
    __aicore__ inline void Compute()
    {
        AscendC::LocalTensor<T> srcLocal = inQueueX.DeQue<T>();
        AscendC::LocalTensor<T> dstLocal = outQueueY.AllocTensor<T>();
        if (tmpSize != 0) {
            pipe.InitBuffer(tmplocalBuf, tmpSize);
            AscendC::LocalTensor<uint8_t> tmplocalTensor = tmplocalBuf.Get<uint8_t>();
            AscendC::Mean<T, accType>(dstLocal, srcLocal, tmplocalTensor, meanParams);
        } else {
            AscendC::Mean<T, accType>(dstLocal, srcLocal, meanParams);
        }
        outQueueY.EnQue<T>(dstLocal);
    }
    __aicore__ inline void CopyOut()
    {
        AscendC::LocalTensor<T> dstLocal = outQueueY.DeQue<T>();
        AscendC::DataCopy(dstGlobal, dstLocal, (meanParams.outter * sizeof(T) + AscendC::ONE_BLK_SIZE - 1) / AscendC::ONE_BLK_SIZE * AscendC::ONE_BLK_SIZE / sizeof(T));
        outQueueY.FreeTensor(dstLocal);
    }

private:
    AscendC::GlobalTensor<T> srcGlobal;
    AscendC::GlobalTensor<T> dstGlobal;
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueX;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueY;
    AscendC::TBuf<AscendC::TPosition::VECCALC> tmplocalBuf;
    AscendC::MeanParams meanParams;
    uint32_t tmpSize;
};

extern "C" __global__ __aicore__ void mean_custom(GM_ADDR x, GM_ADDR y, GM_ADDR workspace, GM_ADDR tiling)
{
    GET_TILING_DATA(tiling_data, tiling);
    if (TILING_KEY_IS(1)) {
        KernelMean<half, half> op;
        op.Init(x, y, tiling_data.outter, tiling_data.inner, tiling_data.n, tiling_data.tmpSize);
        op.Process();
    } else if (TILING_KEY_IS(2)) {
        KernelMean<float, float> op;
        op.Init(x, y, tiling_data.outter, tiling_data.inner, tiling_data.n, tiling_data.tmpSize);
        op.Process();
    } else if (TILING_KEY_IS(3)) {
        KernelMean<half, float> op;
        op.Init(x, y, tiling_data.outter, tiling_data.inner, tiling_data.n, tiling_data.tmpSize);
        op.Process();
    }
}

The definition of tiling on the host is as follows:

#include "register/tilingdata_base.h"
namespace optiling {
BEGIN_TILING_DATA_DEF(MeanCustomTilingData)
  TILING_DATA_FIELD_DEF(uint32_t, outter);
  TILING_DATA_FIELD_DEF(uint32_t, inner);
  TILING_DATA_FIELD_DEF(uint32_t, n);
  TILING_DATA_FIELD_DEF(uint32_t, tmpSize);
END_TILING_DATA_DEF;
REGISTER_TILING_DATA_CLASS(MeanCustom, MeanCustomTilingData)
}

The tiling implementation on the host is as follows:

#include "mean_custom_tiling.h"
#include "register/op_def_registry.h"
#include "tiling/tiling_api.h"
namespace optiling {
static ge::graphStatus TilingFunc(gert::TilingContext* context)
{
    MeanCustomTilingData tiling;
    const gert::StorageShape *src_shape = context->GetInputShape(0); // Two dimensions for src_shape
    uint32_t outter = src_shape->GetStorageShape().GetDim(0);
    uint32_t inner = src_shape->GetStorageShape().GetDim(1);
    
    const gert::RuntimeAttrs *meanattrs = context->GetAttrs();
    const uint32_t n = *(meanattrs->GetAttrPointer<uint32_t>(0)); 
    const uint32_t iscast = *(meanattrs->GetAttrPointer<uint32_t>(1)); // When iscast is set to 1, accType is used to increase the precision.
    const uint32_t sizeflag = *(meanattrs->GetAttrPointer<uint32_t>(2)); // If sizeflag is set to 0, the framework is used to allocate the tmpbuffer. If sizeflag is set to 1, the temporary space is passed through the sharedTmpBuffer input parameter.
    auto dt = context->GetInputTensor(0)->GetDataType();
    uint32_t typeSize = 0;
    if (iscast == 1) {
        typeSize = 2;
        context->SetTilingKey(3);
    } else if(dt == ge::DT_FLOAT16) {
        typeSize = 2;
        context->SetTilingKey(1);
    } else if (dt == ge::DT_FLOAT) {
        typeSize = 4;
        context->SetTilingKey(2);
    }
    uint32_t maxValue = 0;
    uint32_t minValue = 0;
    if (iscast == 1) {
        AscendC::GetMeanMaxMinTmpSize(n, typeSize, 4, false, maxValue, minValue);
    } else {
        AscendC::GetMeanMaxMinTmpSize(n, typeSize, typeSize, false, maxValue, minValue);
    }
    if (sizeflag == 0) {
        tiling.set_tmpSize(0);
    } else {
        tiling.set_tmpSize(minValue);
    }
    tiling.set_outter(outter);
    tiling.set_inner(inner);
    tiling.set_n(n);
    context->SetBlockDim(1);
    tiling.SaveToBuffer(context->GetRawTilingData()->GetData(), context->GetRawTilingData()->GetCapacity());
    context->GetRawTilingData()->SetDataSize(tiling.GetDataSize());
    size_t *currentWorkspace = context->GetWorkspaceSizes(1);
    currentWorkspace[0] = 0;
    return ge::GRAPH_SUCCESS;
}
}

Result example:

If the input is two-dimensional data with a size of 2 × 3 and an element type of half, then outter is 2, n is 3, sizeof(T) is 2, and inner is 16 {(3 × 2 + 32 –1)/32 x 32/2 = 16}.

Input data (srcLocal): [[1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0],
                     [4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 0]]
Output data (dstLocal): [2 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0]

Parent topic: Mean Interface