Sign

Function Description

Performs the Sign operation element-wise. Sign refers to the symbol that returns the input data. If the value is 0, 0 is returned. If the value is a positive number, 1 is returned. If the value is a negative number, -1 is returned.

Prototype

Pass the temporary space through the sharedTmpBuffer input parameter.

All or part of the source operand tensors are involved in computation.

template <typename T, bool isReuseSource = false>
__aicore__ inline void Sign(const LocalTensor<T> &dstTensor, const LocalTensor<T> &srcTensor, const LocalTensor<uint8_t> &sharedTmpBuffer, const uint32_t calCount)

All source operand tensors are involved in computation.

template <typename T, bool isReuseSource = false>
__aicore__ inline void Sign(const LocalTensor<T>& dstTensor, const LocalTensor<T>& srcTensor, const LocalTensor<uint8_t>& sharedTmpBuffer)

Allocate the temporary space through the API framework.

All or part of the source operand tensors are involved in computation.

template <typename T, bool isReuseSource = false>
__aicore__ inline void Sign(const LocalTensor<T> &dstTensor, const LocalTensor<T> &srcTensor, const uint32_t calCount)

All source operand tensors are involved in computation.

template <typename T, bool isReuseSource = false>
__aicore__ inline void Sign(const LocalTensor<T> &dstTensor, const LocalTensor<T> &srcTensor)

Due to the complex mathematical computation involved in the internal implementation of this API, additional temporary space is required to store intermediate variables generated during computation. The temporary space can be passed by developers through the sharedTmpBuffer input parameter or allocated through the API framework.

When the sharedTmpBuffer input parameter is used for passing the temporary space, the tensor serves as the temporary space. In this case, the API framework is not required for temporary space allocation. This enables developers to manage the sharedTmpBuffer space and reuse the buffer after calling the API, so that the buffer is not repeatedly allocated and deallocated, improving the flexibility and buffer utilization.
When the API framework is used for temporary space allocation, developers do not need to allocate the space, but must reserve the required size for the space.

If sharedTmpBuffer is used, developers must allocate space for the tensor. If the API framework is used, developers must reserve the temporary space. To obtain the size of the temporary space (BufferSize) to be reserved, use the API provided in GetSignMaxMinTmpSize.

Parameters

**Table 1** Parameters in the template
Parameter	Description
T	Data type of the operand.
isReuseSource	Whether the source operand can be modified. This parameter is reserved. Pass the default value false.

**Table 2** API parameters
Parameter	Input/Output	Description
dstTensor	Output	Destination operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.
srcTensor	Input	Source operand. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. The source operand must have the same data type as the destination operand.
sharedTmpBuffer	Input	Temporary buffer. The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT. This parameter is used to store intermediate variables during complex computation in Sign and is provided by developers. For details about how to obtain the temporary space size (BufferSize), see GetSignMaxMinTmpSize.
calCount	Input	Number of actually computed data elements. The value range is [0, srcTensor.GetSize()].

Returns

None

Availability

Constraints

The source operand address must not overlap the destination operand address.
Currently, only the ND format is supported.
Ensure that calCount is less than or equal to the element range stored in srcTensor and dstTensor.
sharedTmpBuffer must not overlap the addresses of the source operand and destination operand.
For details about the alignment requirements of the operand address offset, see General Restrictions.

Example

sign_custom.cpp on the kernel:

#include "kernel_operator.h"

constexpr int32_t BUFFER_NUM = 1;
class KernelSign
{
public:
    __aicore__ inline KernelSign() {}
    __aicore__ inline void Init(GM_ADDR x, GM_ADDR y, uint32_t totalLength, uint32_t tilenum, uint32_t tmpSize, uint32_t mcount)
    {
        this->totalLength = totalLength;
        this->blockLength = totalLength / AscendC::GetBlockNum();
        this->tilenum = tilenum;
        this->tmpSize = tmpSize;
        this->mcount = mcount;
        this->tileLength = this->blockLength / tilenum / BUFFER_NUM;
        xGm.SetGlobalBuffer((__gm__ half *)x + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);
        yGm.SetGlobalBuffer((__gm__ half *)y + this->blockLength * AscendC::GetBlockIdx(), this->blockLength);
        if (this->tmpSize != 0)
        {
            pipe.InitBuffer(tmpQueue, BUFFER_NUM, this->tmpSize);
        }
        pipe.InitBuffer(inQueueX, BUFFER_NUM, this->tileLength * sizeof(half));
        pipe.InitBuffer(outQueueY, BUFFER_NUM, this->tileLength * sizeof(half));
    }
    __aicore__ inline void Process()
    {
        int32_t loopCount = this->tilenum * BUFFER_NUM;
        for (int32_t i = 0; i < loopCount; i++)
        {
            CopyIn(i);
            Compute(i);
            CopyOut(i);
        }
    }

private:
    __aicore__ inline void CopyIn(int32_t progress)
    {
        AscendC::LocalTensor<half> xLocal = inQueueX.AllocTensor<half>();
        AscendC::DataCopy(xLocal, xGm[progress * this->tileLength], this->tileLength);
        inQueueX.EnQue(xLocal);
    }
    __aicore__ inline void Compute(int32_t progress)
    {
        AscendC::LocalTensor<half> xLocal = inQueueX.DeQue<half>();
        AscendC::LocalTensor<half> yLocal = outQueueY.AllocTensor<half>();
        if (this->tmpSize != 0)
        { // sharedTmpBuffer is passed.
            AscendC::LocalTensor<uint8_t> tmpLocal = tmpQueue.AllocTensor<uint8_t>();
            if (this->mcount != this->totalLength)
            { // Whether to pass calCount.
                AscendC::Sign(yLocal, xLocal, tmpLocal, this->mcount);
            }
            else
            {
                AscendC::Sign(yLocal, xLocal, tmpLocal);
            }
            tmpQueue.FreeTensor(tmpLocal);
        }
        else
        { // sharedTmpBuffer is not passed.
            if (this->mcount != this->totalLength)
            {
                AscendC::Sign(yLocal, xLocal, this->mcount);
            }
            else
            {
                AscendC::Sign(yLocal, xLocal);
            }
        }
        outQueueY.EnQue<half>(yLocal);
        inQueueX.FreeTensor(xLocal);
    }
    __aicore__ inline void CopyOut(int32_t progress)
    {
        AscendC::LocalTensor<half> yLocal = outQueueY.DeQue<half>();
        AscendC::DataCopy(yGm[progress * this->tileLength], yLocal, this->tileLength);
        outQueueY.FreeTensor(yLocal);
    }

private:
    AscendC::TPipe pipe;
    AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> inQueueX;
    AscendC::TQue<AscendC::QuePosition::VECIN, BUFFER_NUM> tmpQueue;
    AscendC::TQue<AscendC::QuePosition::VECOUT, BUFFER_NUM> outQueueY;
    AscendC::GlobalTensor<half> xGm;
    AscendC::GlobalTensor<half> yGm;
    uint32_t blockLength;
    uint32_t tilenum;
    uint32_t tileLength;
    uint32_t tmpSize;
    uint32_t mcount;
    uint32_t totalLength;
};
extern "C" __global__ __aicore__ void sign_custom(GM_ADDR x, GM_ADDR y, GM_ADDR workspace, GM_ADDR tiling)
{
    GET_TILING_DATA(tilingData, tiling);
    KernelSign op;
    op.Init(x, y, tilingData.totalLength, tilingData.tilenum, tilingData.tmpSize, tilingData.mcount);
    if (TILING_KEY_IS(1))
    {
        op.Process();
    }
}

sign_custom_tiling.h on the host:

#include "register/tilingdata_base.h"

namespace optiling {
BEGIN_TILING_DATA_DEF(SignCustomTilingData)
  TILING_DATA_FIELD_DEF(uint32_t, totalLength);
  TILING_DATA_FIELD_DEF(uint32_t, tmpSize);
  TILING_DATA_FIELD_DEF(uint32_t, tilenum);
  TILING_DATA_FIELD_DEF(uint32_t, mcount);
END_TILING_DATA_DEF;

REGISTER_TILING_DATA_CLASS(SignCustom, SignCustomTilingData)
}

sign_custom.cpp on the host:

#include "sign_custom_tiling.h"
#include "register/op_def_registry.h"
#include "tiling/tiling_api.h"
namespace optiling {
static ge::graphStatus TilingFunc(gert::TilingContext* context)
{
  SignCustomTilingData tiling;
  const gert::RuntimeAttrs* cosattrs = context->GetAttrs();
  const uint32_t tilenum = *(cosattrs->GetAttrPointer<uint32_t>(0));
  const uint32_t blockdim = *(cosattrs->GetAttrPointer<uint32_t>(1));
  const uint32_t sizeflag = *(cosattrs->GetAttrPointer<uint32_t>(2));
  const uint32_t countflag = *(cosattrs->GetAttrPointer<uint32_t>(3));
  uint32_t totalLength = context->GetInputTensor(0)->GetShapeSize();
  auto dt = context->GetInputTensor(0)->GetDataType();
  context->SetBlockDim(blockdim);
  tiling.set_totalLength(totalLength);
  tiling.set_tilenum(tilenum);
  if (countflag == 1) {
      tiling.set_mcount(totalLength);
  }
  std::vector<int64_t> shape_vec = {totalLength};
  ge::Shape srcShape(shape_vec);
  uint32_t maxValue = 0;
  uint32_t minValue = 0;
  uint32_t dtypesize;
  if (dt == ge::DT_FLOAT16) {
      dtypesize = 2;
  } else {
      dtypesize = 4;
  }
  bool isReuseSource = false;
  AscendC::GetSignMaxMinTmpSize(srcShape, dtypesize, isReuseSource, maxValue, minValue);
  if (sizeflag == 0) {  // If sizeflag is set to 0, sharedTmpBuffer of the minimum size is passed. If it set to 1, sharedTmpBuffer of the maximum size is passed. If it set to 2, sharedTmpBuffer is not passed.
      tiling.set_tmpSize(minValue);
  } else if (sizeflag == 1) {
      tiling.set_tmpSize(maxValue);
  } else if (sizeflag == 2) {
      tiling.set_tmpSize(0);
  }
  tiling.SaveToBuffer(context->GetRawTilingData()->GetData(), context->GetRawTilingData()->GetCapacity());
  context->GetRawTilingData()->SetDataSize(tiling.GetDataSize());
  context->SetTilingKey(1);
  size_t *currentWorkspace = context->GetWorkspaceSizes(1);
  currentWorkspace[0] = 0;
  return ge::GRAPH_SUCCESS;
}
}
namespace ge {
static ge::graphStatus InferShape(gert::InferShapeContext* context)
{
    const gert::Shape* x1_shape = context->GetInputShape(0);
    gert::Shape* y_shape = context->GetOutputShape(0);
    *y_shape = *x1_shape;
    return GRAPH_SUCCESS;
}
}
namespace ops {
class SignCustom : public OpDef {
public:
    explicit SignCustom(const char* name) : OpDef(name)
    {
        this->Input("x")
            .ParamType(REQUIRED)
            .DataType({ge::DT_FLOAT16})
            .Format({ge::FORMAT_ND});
        this->Output("y")
            .ParamType(REQUIRED)
            .DataType({ge::DT_FLOAT16})
            .Format({ge::FORMAT_ND});
        this->SetInferShape(ge::InferShape);
        this->Attr("tilenum")
            .AttrType(REQUIRED)            
            .Int(0);
        this->Attr("blockdim")
            .AttrType(REQUIRED)            
            .Int(0);
        this->Attr("sizeflag")
            .AttrType(REQUIRED)            
            .Int(0);
        this->Attr("countflag")
            .AttrType(REQUIRED)
            .Int(0);
        this->AICore()
            .SetTiling(optiling::TilingFunc);
        this->AICore().AddConfig("ascendxxx");  // Replace xxx with the actual processor information.
    }
};
}

Result example:

The data type of the input and output is float. The one-dimensional vector contains eight digits.
Input data (srcLocal): [-np.inf, -2.0, -0.0, 0.0, np.nan, -np.nan, 2.0, np.inf]
Output data (dstLocal): [-1, -1, 0, 0, 0, 0, 1, 1]

Parent topic: Sign