Sign

Function Description

Performs the Sign operation element-wise. Sign refers to the symbol that returns the input data. If the value is 0, 0 is returned. If the value is a positive number, 1 is returned. If the value is a negative number, -1 is returned.

Prototype

  • Pass the temporary space through the sharedTmpBuffer input parameter.
    • All or part of the source operand tensors are involved in computation.
      1
      2
      template <typename T, bool isReuseSource = false>
      __aicore__ inline void Sign(const LocalTensor<T> &dstTensor, const LocalTensor<T> &srcTensor, const LocalTensor<uint8_t> &sharedTmpBuffer, const uint32_t calCount)
      
    • All source operand tensors are involved in computation.
      1
      2
      template <typename T, bool isReuseSource = false>
      __aicore__ inline void Sign(const LocalTensor<T>& dstTensor, const LocalTensor<T>& srcTensor, const LocalTensor<uint8_t>& sharedTmpBuffer)
      
  • Allocate the temporary space through the API framework.
    • All or part of the source operand tensors are involved in computation.
      1
      2
      template <typename T, bool isReuseSource = false>
      __aicore__ inline void Sign(const LocalTensor<T> &dstTensor, const LocalTensor<T> &srcTensor, const uint32_t calCount)
      
    • All source operand tensors are involved in computation.
      1
      2
      template <typename T, bool isReuseSource = false>
      __aicore__ inline void Sign(const LocalTensor<T> &dstTensor, const LocalTensor<T> &srcTensor)
      

Due to the complex mathematical computation involved in the internal implementation of this API, additional temporary space is required to store intermediate variables generated during computation. The temporary space can be passed by developers through the sharedTmpBuffer input parameter or allocated through the API framework.

  • When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.
  • When the API framework is used for temporary space allocation, developers do not need to allocate the space, but must reserve the required size for the space.

If sharedTmpBuffer is used, developers must allocate space for the tensor. If the API framework is used, developers must reserve the temporary space. To obtain the size of the temporary space (BufferSize) to be reserved, use the API provided in GetSignMaxMinTmpSize.

Parameters

Table 1 Parameters in the template

Parameter

Description

T

Data type of the operand.

isReuseSource

Whether the source operand can be modified. This parameter is reserved. Pass the default value false.

Table 2 API parameters

Parameter

Input/Output

Description

dstTensor

Output

Destination operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

srcTensor

Input

Source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The source operand must have the same data type as the destination operand.

sharedTmpBuffer

Input

Temporary buffer.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

This parameter is used to store intermediate variables during complex computation in Sign and is provided by developers.

For details about how to obtain the temporary space size (BufferSize), see GetSignMaxMinTmpSize.

calCount

Input

Number of actually computed data elements. The value range is [0, srcTensor.GetSize()].

Returns

None

Availability

Constraints

  • The source operand address must not overlap the destination operand address.
  • Currently, only the ND format is supported.
  • Ensure that calCount is less than or equal to the element range stored in srcTensor and dstTensor.
  • sharedTmpBuffer must not overlap the addresses of the source operand and destination operand.
  • For details about the alignment requirements of the operand address offset, see General Restrictions.

Example

sign_custom.cpp on the kernel:

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
#include "kernel_operator.h"

constexpr int32_t BUFFER_NUM = 1;
class KernelSign
{
public:
    __aicore__ inline KernelSign() {}
    __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, uint32_t totalLength, uint32_t tilenum, uint32_t tmpSize, uint32_t mcount)
    {
        this->totalLength = totalLength;
        this->blockLength = totalLength / AscendC::GetBlockNum();
        this->tilenum = tilenum;
        this->tmpSize = tmpSize;
        this->mcount = mcount;
        this->tileLength = this->blockLength / tilenum / BUFFER_NUM;
        xGm.SetGlobalBuffer((__gm__ half *)x + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);
        yGm.SetGlobalBuffer((__gm__ half *)y + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);
        if (this->tmpSize != 0)
        {
            pipe.InitBuffer(tmpQueue, BUFFER_NUM, this->tmpSize);
        }
        pipe.InitBuffer(inQueueX, BUFFER_NUM, this->tileLength * sizeof(half));
        pipe.InitBuffer(outQueueY, BUFFER_NUM, this->tileLength * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        int32_t loopCount = this->tilenum * BUFFER_NUM;
        for (int32_t i = 0; i < loopCount; i++)
        {
            CopyIn(i);
            Compute(i);
            CopyOut(i);
        }
    }

private:
    __aicore__ inline void CopyIn(int32_t progress)
    {
        AscendC::LocalTensor<half> xLocal = inQueueX.AllocTensor<half>();
        AscendC::DataCopy(xLocal, xGm[progress * this->tileLength], this->tileLength);
        inQueueX.EnQue(xLocal);
    }
    __aicore__ inline void Compute(int32_t progress)
    {
        AscendC::LocalTensor<half> xLocal = inQueueX.DeQue<half>();
        AscendC::LocalTensor<half> yLocal = outQueueY.AllocTensor<half>();
        if (this->tmpSize != 0)
        { // sharedTmpBuffer is passed.
            AscendC::LocalTensor<uint8_t> tmpLocal = tmpQueue.AllocTensor<uint8_t>();
            if (this->mcount != this->totalLength)
            { // Whether to pass calCount.
                AscendC::Sign(yLocal, xLocal, tmpLocal, this->mcount);
            }
            else
            {
                AscendC::Sign(yLocal, xLocal, tmpLocal);
            }
            tmpQueue.FreeTensor(tmpLocal);
        }
        else
        { // sharedTmpBuffer is not passed.
            if (this->mcount != this->totalLength)
            {
                AscendC::Sign(yLocal, xLocal, this->mcount);
            }
            else
            {
                AscendC::Sign(yLocal, xLocal);
            }
        }
        outQueueY.EnQue<half>(yLocal);
        inQueueX.FreeTensor(xLocal);
    }
    __aicore__ inline void CopyOut(int32_t progress)
    {
        AscendC::LocalTensor<half> yLocal = outQueueY.DeQue<half>();
        AscendC::DataCopy(yGm[progress * this->tileLength], yLocal, this->tileLength);
        outQueueY.FreeTensor(yLocal);
    }

private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueueX;
    AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> tmpQueue;
    AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> outQueueY;
    AscendC::GlobalTensor<half> xGm;
    AscendC::GlobalTensor<half> yGm;
    uint32_t blockLength;
    uint32_t tilenum;
    uint32_t tileLength;
    uint32_t tmpSize;
    uint32_t mcount;
    uint32_t totalLength;
};
extern "C" __global__ __aicore__ void sign_custom(GM_ADDR x, GM_ADDR y, GM_ADDR workspace, GM_ADDR tiling)
{
    GET_TILING_DATA(tilingData, tiling);
    KernelSign op;
    op.Init(x, y, tilingData.totalLength, tilingData.tilenum, tilingData.tmpSize, tilingData.mcount);
    if (TILING_KEY_IS(1))
    {
        op.Process();
    }
}

sign_custom_tiling.h on the host:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
#include "register/tilingdata_base.h"

namespace optiling {
BEGIN_TILING_DATA_DEF(SignCustomTilingData)
  TILING_DATA_FIELD_DEF(uint32_t, totalLength);
  TILING_DATA_FIELD_DEF(uint32_t, tmpSize);
  TILING_DATA_FIELD_DEF(uint32_t, tilenum);
  TILING_DATA_FIELD_DEF(uint32_t, mcount);
END_TILING_DATA_DEF;

REGISTER_TILING_DATA_CLASS(SignCustom, SignCustomTilingData)
}

sign_custom.cpp on the host:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
#include "sign_custom_tiling.h"
#include "register/op_def_registry.h"
#include "tiling/tiling_api.h"
namespace optiling {
static ge::graphStatus TilingFunc(gert::TilingContext* context)
{
  SignCustomTilingData tiling;
  const gert::RuntimeAttrs* cosattrs = context->GetAttrs();
  const uint32_t tilenum = *(cosattrs->GetAttrPointer<uint32_t>(0));
  const uint32_t blockdim = *(cosattrs->GetAttrPointer<uint32_t>(1));
  const uint32_t sizeflag = *(cosattrs->GetAttrPointer<uint32_t>(2));
  const uint32_t countflag = *(cosattrs->GetAttrPointer<uint32_t>(3));
  uint32_t totalLength = context->GetInputTensor(0)->GetShapeSize();
  auto dt = context->GetInputTensor(0)->GetDataType();
  context->SetBlockDim(blockdim);
  tiling.set_totalLength(totalLength);
  tiling.set_tilenum(tilenum);
  if (countflag == 1) {
      tiling.set_mcount(totalLength);
  }
  std::vector<int64_t> shape_vec = {totalLength};
  ge::Shape srcShape(shape_vec);
  uint32_t maxValue = 0;
  uint32_t minValue = 0;
  uint32_t dtypesize;
  if (dt == ge::DT_FLOAT16) {
      dtypesize = 2;
  } else {
      dtypesize = 4;
  }
  bool isReuseSource = false;
  AscendC::GetSignMaxMinTmpSize(srcShape, dtypesize, isReuseSource, maxValue, minValue);
  if (sizeflag == 0) {  // If sizeflag is set to 0, sharedTmpBuffer of the minimum size is passed. If it set to 1, sharedTmpBuffer of the maximum size is passed. If it set to 2, sharedTmpBuffer is not passed.
      tiling.set_tmpSize(minValue);
  } else if (sizeflag == 1) {
      tiling.set_tmpSize(maxValue);
  } else if (sizeflag == 2) {
      tiling.set_tmpSize(0);
  }
  tiling.SaveToBuffer(context->GetRawTilingData()->GetData(), context->GetRawTilingData()->GetCapacity());
  context->GetRawTilingData()->SetDataSize(tiling.GetDataSize());
  context->SetTilingKey(1);
  size_t *currentWorkspace = context->GetWorkspaceSizes(1);
  currentWorkspace[0] = 0;
  return ge::GRAPH_SUCCESS;
}
}
namespace ge {
static ge::graphStatus InferShape(gert::InferShapeContext* context)
{
    const gert::Shape* x1_shape = context->GetInputShape(0);
    gert::Shape* y_shape = context->GetOutputShape(0);
    *y_shape = *x1_shape;
    return GRAPH_SUCCESS;
}
}
namespace ops {
class SignCustom : public OpDef {
public:
    explicit SignCustom(const char* name) : OpDef(name)
    {
        this->Input("x")
            .ParamType(REQUIRED)
            .DataType({ge::DT_FLOAT16})
            .Format({ge::FORMAT_ND});
        this->Output("y")
            .ParamType(REQUIRED)
            .DataType({ge::DT_FLOAT16})
            .Format({ge::FORMAT_ND});
        this->SetInferShape(ge::InferShape);
        this->Attr("tilenum")
            .AttrType(REQUIRED)            
            .Int(0);
        this->Attr("blockdim")
            .AttrType(REQUIRED)            
            .Int(0);
        this->Attr("sizeflag")
            .AttrType(REQUIRED)            
            .Int(0);
        this->Attr("countflag")
            .AttrType(REQUIRED)
            .Int(0);
        this->AICore()
            .SetTiling(optiling::TilingFunc);
        this->AICore().AddConfig("ascendxxx");  // Replace xxx with the actual processor information.
    }
};
}
Result example:
1
2
3
The data type of the input and output is float. The one-dimensional vector contains eight digits.
Input data (srcLocal): [-np.inf, -2.0, -0.0, 0.0, np.nan, -np.nan, 2.0, np.inf]
Output data (dstLocal): [-1, -1, 0, 0, 0, 0, 1, 1]