Mean

Function Description

Calculates the mean of elements according to the direction of the last axis.

If the input is a vector, the mean of elements is calculated within the vector. If the input is a matrix, the mean of elements is calculated along the last dimension. This API is limited to accepting inputs with a maximum of two dimensions and does not support inputs with higher dimensions.

As illustrated in the figure below, the mean operation is conducted on a two-dimensional matrix with a shape of (4, 5) and yields the output result of [3, 8, 13, 18].

To understand the specific functions of the API, some essential concepts need to be put forth. The number of rows of data is referred to as the outer axis length (outter), and the actual number of elements in each row is dubbed the actual quantity of elements on the inner axis (n). The number of elements obtained after the actual number of elements on the inner axis (n) is aligned with 32 bytes is referred to as the padded quantity of inner axis elements (inner). This API requires that the input inner axis length be 32-byte aligned. If the byte length occupied by n is not a multiple of 32 bytes, developers need to pad it up to an integer multiple of 32 bytes. In the following example, the element type is float, the actual number of elements (n) in each row is 5, and the occupied byte length is 20 bytes, which is not a multiple of 32 bytes. After padding up, 32 bytes are obtained, and the corresponding number of elements becomes 8. In the preceding figure, padding indicates the padding operation. The relationship between n and inner is as follows: inner = (n × sizeof(T) + 32 – 1) / 32 × 32 / sizeof(T).

Prototype

  • Pass the temporary space through the sharedTmpBuffer input parameter.
    1
    2
    template <typename T, typename accType = T, bool isReuseSource = false, bool isBasicBlock = false, int32_t reduceDim = -1>
    __aicore__ inline void Mean(const LocalTensor<T> &dstTensor, const LocalTensor<T> &srcTensor, const LocalTensor<uint8_t> &sharedTmpBuffer, const MeanParams &meanParams)
    
  • Allocate the temporary space through the API framework.
    1
    2
    template <typename T, typename accType = T, bool isReuseSource = false, bool isBasicBlock = false, int32_t reduceDim = -1>
    __aicore__ inline void Mean(const LocalTensor<T> &dstTensor, const LocalTensor<T> &srcTensor, const MeanParams &meanParams)
    

Due to the complex mathematical computation involved in the internal implementation of this API, additional temporary space is required to store intermediate variables generated during computation. The temporary space can be passed by developers through the sharedTmpBuffer input parameter or allocated through the API framework.

  • When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.
  • When the API framework is used for temporary space allocation, developers do not need to allocate the space, but must reserve the required size for the space.

If sharedTmpBuffer is used, developers must allocate space for the tensor. If the API framework is used, developers must reserve the temporary space. To obtain the size of the temporary space (BufferSize) to be reserved, use the API provided in GetMeanMaxMinTmpSize.

Parameters

Table 1 Parameters in the template

Parameter

Description

T

Data type of the operand.

accType

Data type actually involved in computation. If the set accType precision is higher than that of the input T, the input will be converted to accType before the computation. After the computation is complete, the input will be converted back to the original data type. Setting the accType value to increase precision can prevent data type overflows. When T is of half type, you can set accType to float, which means that the input half precision will be up-cast to float precision for computation. The precision of accType cannot be lower than that of T.

isReuseSource

Whether the source operand can be modified. This parameter is reserved. Pass the default value false.

isBasicBlock

If the shape information and the tiling strategies of both srcTensor and dstTensor meet the basic block requirements, this parameter can be enabled to improve performance. By default, this parameter is disabled. For basic blocks, the shapes of srcTensor and dstTensor must meet the following requirements:

  • The length of the last axis (H axis) is a multiple of 64 but less than 2048.
  • The length (B*S) of a non-last axis is a multiple of 8.

reduceDim

Dimension along which data is summed. This API is implemented based on the last dimension. The reduceDim parameter is not supported. Pass the default value -1.

Table 2 API parameters

Parameter

Input/Output

Description

dstTensor

Output

Destination operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The output value needs to be saved in a space with a size of outter * sizeof(T). Developers need to allocate the actual buffer space to dstTensor based on this size and the framework's alignment requirements.

srcTensor

Input

Source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The source operand must have the same data type as the destination operand.

The input shape is outter * inner. Developers need to allocate a space with a size of outter * inner * sizeof(T) for it.

sharedTmpBuffer

Input

Temporary buffer.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

This parameter is used to store intermediate variables during complex computation in Mean and is provided by developers.

For details about how to obtain the temporary space size (BufferSize), see GetMeanMaxMinTmpSize.

MeanParams

Input

Shape of srcTensor, MeanParams type. The specific definition is as follows:

1
2
3
4
5
struct MeanParams{
    uint32_t outter = 1;    // Outer axis length of the input.
    uint32_t inner;         // Actual number of elements obtained after elements on the inner axis of the input is aligned with 32 bytes. The value of inner*sizeof(T) must be an integer multiple of 32 bytes.
    uint32_t n;             // Actual number of elements on the inner axis of the input.
};
  • The value of MeanParams.inner*sizeof(T) must be an integer multiple of 32 bytes.
  • MeanParams.inner is the value of MeanParams.n after 32-byte alignment upwards, where inner = (n × sizeof(T) + 32 – 1) / 32 × 32 / sizeof(T). Therefore, the size of MeanParams.n should meet: 1 ≤ MeanParams.n ≤ MeanParams.inner.

Returns

None

Availability

Constraints

  • For details about the alignment requirements of the operand address offset, see General Restrictions.
  • The source operand address must not overlap the destination operand address.
  • The address of sharedTmpBuffer must not overlap the addresses of the source operand and destination operand.
  • Currently, only the ND format is supported.
  • For Mean, the approach used is to first calculate the sum and then perform division. The underlying addition method used by Mean is consistent with that employed by Sum, ReducceSum, and WholeReducceSum, both of which utilize a binary tree approach to add two elements at a time. For details, refer to Sum.

Example

  • Call example in the kernel
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    #include "kernel_operator.h"
    template <typename T, typename accType>
    class KernelMean
    {
    public:
        __aicore__ inline KernelMean() {}
        __aicore__ inline void Init(__gm__ uint8_t *srcGm, __gm__ uint8_t *dstGm, uint32_t outter, uint32_t inner, uint32_t n, uint32_t Size)
        {
            meanParams.outter = outter;
            meanParams.inner = inner;
            meanParams.n = n;
            tmpSize = Size;
            srcGlobal.SetGlobalBuffer((__gm__ T *)srcGm);
            dstGlobal.SetGlobalBuffer((__gm__ T *)dstGm);
            pipe.InitBuffer(inQueueX, 1, meanParams.outter * meanParams.inner * sizeof(T));
            pipe.InitBuffer(outQueueY, 1, (meanParams.outter * sizeof(T) + AscendC::ONE_BLK_SIZE - 1) / AscendC::ONE_BLK_SIZE * AscendC::ONE_BLK_SIZE);
        }
        __aicore__ inline void Process()
        {
            CopyIn();
            Compute();
            CopyOut();
        }
    
    private:
        __aicore__ inline void CopyIn()
        {
            AscendC::LocalTensor<T> srcLocal = inQueueX.AllocTensor<T>();
            AscendC::DataCopy(srcLocal, srcGlobal, meanParams.outter * meanParams.inner);
            inQueueX.EnQue(srcLocal);
        }
        __aicore__ inline void Compute()
        {
            AscendC::LocalTensor<T> srcLocal = inQueueX.DeQue<T>();
            AscendC::LocalTensor<T> dstLocal = outQueueY.AllocTensor<T>();
            if (tmpSize != 0) {
                pipe.InitBuffer(tmplocalBuf, tmpSize);
                AscendC::LocalTensor<uint8_t> tmplocalTensor = tmplocalBuf.Get<uint8_t>();
                AscendC::Mean<T, accType>(dstLocal, srcLocal, tmplocalTensor, meanParams);
            } else {
                AscendC::Mean<T, accType>(dstLocal, srcLocal, meanParams);
            }
            outQueueY.EnQue<T>(dstLocal);
        }
        __aicore__ inline void CopyOut()
        {
            AscendC::LocalTensor<T> dstLocal = outQueueY.DeQue<T>();
            AscendC::DataCopy(dstGlobal, dstLocal, (meanParams.outter * sizeof(T) + AscendC::ONE_BLK_SIZE - 1) / AscendC::ONE_BLK_SIZE * AscendC::ONE_BLK_SIZE / sizeof(T));
            outQueueY.FreeTensor(dstLocal);
        }
    
    private:
        AscendC::GlobalTensor<T> srcGlobal;
        AscendC::GlobalTensor<T> dstGlobal;
        AscendC::TPipe pipe;
        AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueX;
        AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueY;
        AscendC::TBuf<AscendC::TPosition::VECCALC> tmplocalBuf;
        AscendC::MeanParams meanParams;
        uint32_t tmpSize;
    };
    
    extern "C" __global__ __aicore__ void mean_custom(GM_ADDR x, GM_ADDR y, GM_ADDR workspace, GM_ADDR tiling)
    {
        GET_TILING_DATA(tiling_data, tiling);
        if (TILING_KEY_IS(1)) {
            KernelMean<half, half> op;
            op.Init(x, y, tiling_data.outter, tiling_data.inner, tiling_data.n, tiling_data.tmpSize);
            op.Process();
        } else if (TILING_KEY_IS(2)) {
            KernelMean<float, float> op;
            op.Init(x, y, tiling_data.outter, tiling_data.inner, tiling_data.n, tiling_data.tmpSize);
            op.Process();
        } else if (TILING_KEY_IS(3)) {
            KernelMean<half, float> op;
            op.Init(x, y, tiling_data.outter, tiling_data.inner, tiling_data.n, tiling_data.tmpSize);
            op.Process();
        }
    }
    
  • The definition of tiling on the host is as follows:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    #include "register/tilingdata_base.h"
    namespace optiling {
    BEGIN_TILING_DATA_DEF(MeanCustomTilingData)
      TILING_DATA_FIELD_DEF(uint32_t, outter);
      TILING_DATA_FIELD_DEF(uint32_t, inner);
      TILING_DATA_FIELD_DEF(uint32_t, n);
      TILING_DATA_FIELD_DEF(uint32_t, tmpSize);
    END_TILING_DATA_DEF;
    REGISTER_TILING_DATA_CLASS(MeanCustom, MeanCustomTilingData)
    }
    
  • The tiling implementation on the host is as follows:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    #include "mean_custom_tiling.h"
    #include "register/op_def_registry.h"
    #include "tiling/tiling_api.h"
    namespace optiling {
    static ge::graphStatus TilingFunc(gert::TilingContext* context)
    {
        MeanCustomTilingData tiling;
        const gert::StorageShape *src_shape = context->GetInputShape(0); // two dimensions for src_shape
        uint32_t outter = src_shape->GetStorageShape().GetDim(0);
        uint32_t inner = src_shape->GetStorageShape().GetDim(1);
        
        const gert::RuntimeAttrs *meanattrs = context->GetAttrs();
        const uint32_t n = *(meanattrs->GetAttrPointer<uint32_t>(0)); 
        const uint32_t iscast = *(meanattrs->GetAttrPointer<uint32_t>(1)); // When iscast is set to 1, accType is used to increase the precision.
        const uint32_t sizeflag = *(meanattrs->GetAttrPointer<uint32_t>(2)); // If sizeflag is set to 0, the framework is used to allocate the tmpbuffer. If sizeflag is set to 1, the temporary space is passed through the sharedTmpBuffer input parameter.
        auto dt = context->GetInputTensor(0)->GetDataType();
        uint32_t typeSize = 0;
        if (iscast == 1) {
            typeSize = 2;
            context->SetTilingKey(3);
        } else if(dt == ge::DT_FLOAT16) {
            typeSize = 2;
            context->SetTilingKey(1);
        } else if (dt == ge::DT_FLOAT) {
            typeSize = 4;
            context->SetTilingKey(2);
        }
        uint32_t maxValue = 0;
        uint32_t minValue = 0;
        if (iscast == 1) {
            AscendC::GetMeanMaxMinTmpSize(n, typeSize, 4, false, maxValue, minValue);
        } else {
            AscendC::GetMeanMaxMinTmpSize(n, typeSize, typeSize, false, maxValue, minValue);
        }
        if (sizeflag == 0) {
            tiling.set_tmpSize(0);
        } else {
            tiling.set_tmpSize(minValue);
        }
        tiling.set_outter(outter);
        tiling.set_inner(inner);
        tiling.set_n(n);
        context->SetBlockDim(1);
        tiling.SaveToBuffer(context->GetRawTilingData()->GetData(), context->GetRawTilingData()->GetCapacity());
        context->GetRawTilingData()->SetDataSize(tiling.GetDataSize());
        size_t *currentWorkspace = context->GetWorkspaceSizes(1);
        currentWorkspace[0] = 0;
        return ge::GRAPH_SUCCESS;
    }
    }
    

Result example:

If the input is two-dimensional data with a size of 2 × 3 and an element type of half, then outter is 2, n is 3, sizeof(T) is 2, and inner is 16 {(3 × 2 + 32 – 1)/32 * 32 / 2 = 16}.
1
2
3
Input (srcLocal): [[1 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0],
                     [4 5 6 0 0 0 0 0 0 0 0 0 0 0 0 0]]
Output (dstLocal): [2 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0]