Ascend C APIs

Ascend C provides a group of class library APIs. You can use the standard C++ syntax and class library APIs for programming. Ascend C programming class library APIs are classified into the following types:

  • Kernel APIs: implement the operator kernel function, including:
    • Basic data structures: basic data structures used in kernel APIs, such as GlobalTensor and LocalTensor.
    • Basic APIs: abstract hardware capabilities and open chip capabilities to ensure completeness and compatibility. APIs marked as Instruction Set Architecture Special Interface (ISASI, hardware architecture-related APIs) do not guarantee compatibility across hardware versions.
    • High-level APIs: implement common computing algorithms to improve programming and development efficiency based on basic APIs. High-level APIs include math library, Matmul, Softmax, and others, and ensure compatibility.
  • Host APIs:
    • Tiling APIs: provide tiling parameters required for kernel computation.
    • Ascend C operator prototype registration and management APIs: define and register the Ascend C operator prototype.
    • Tiling data structure registration APIs: define and register the TilingData structure of the Ascend C operator.
    • Platform information obtaining APIs: provide functions to obtain hardware platform information, such as the number of platform cores, to support tiling computation during tiling function implementation on the host.
  • Operator debugging APIs: used for operator debugging, including twin debugging and performance debugging.

Basic data structures and APIs are required for Ascend C operator programming on the host. For details, see Basic Data Structures and APIs. Runtime APIs are required for operator calling after operator development. For details, see "AscendCL API Reference" in CANN AscendCL Application Software Development Guide (C&C++).

Kernel API - Basic APIs

Table 1 Scalar computation APIs

API

Function

ScalarGetCountOfValue

Obtains the number of 0s or 1s in a binary number of the uint64_t type.

ScalarCountLeadingZero

Computes the number of leading 0s of a uint64_t number (number of 0s from the most significant bit to the first 1 in the binary number).

ScalarCast

Converts the type of a scalar to a specified type.

CountBitsCntSameAsSignBit

Computes the number of consecutive bits that are the same as the sign bit from the most significant bit in the binary number of the uint64_t type.

ScalarGetSFFValue

Obtains the location where the first 0 or 1 appears in a binary number of the uint64_t type.

ToBfloat16

Converts scalar data of the float type to scalar data of the bfloat16_t type.

ToFloat

Converts scalar data of the bfloat16_t type to scalar data of the float type.

Table 2 Vector computation APIs

Category

API

Function

One-Operand Instructions

Exp

Computes the natural exponent based on elements.

Ln

Computes the natural logarithm based on elements.

Abs

Computes the absolute value based on elements.

Reciprocal

Computes the reciprocal based on elements.

Sqrt

Extracts the square root based on elements.

Rsqrt

Computes the reciprocal after square root extraction based on elements.

Not

Performs bitwise Not based on elements.

Relu

Performs a ReLU operation based on elements.

Two-Operand Instructions

Add

Performs addition based on elements.

Sub

Performs subtraction based on elements.

Mul

Performs multiplication based on elements.

Div

Performs division based on elements.

Max

Computes the maximum value based on elements.

Min

Computes the minimum value based on elements.

And

Performs a bitwise AND operation based on elements.

Or

Performs a bitwise OR operation based on elements.

AddRelu

Adds inputs element-wise and chooses the larger between the result and 0.

AddReluCast

Adds inputs element-wise and chooses the larger between the result and 0, and converts precision based on the data types of the source and destination operand tensors.

AddDeqRelu

Adds inputs element-wise, performs Deq quantization on the result, and then performs ReLU calculation on the result (obtains the larger between the result and 0).

SubRelu

Computes the difference element-wise and chooses the larger between the result and 0.

SubReluCast

Computes the difference element-wise and chooses the larger between the result and 0, and converts precision based on the data types of the source and destination operand tensors.

MulAddDst

Multiplies src0Local and src1Local element-wise, adds them to dstLocal, and saves the final result to dstLocal.

MulCast

Performs precision conversion after the product is calculated based on elements.

FusedMulAdd

Multiplies src0Local and dstLocal element-wise, adds src1Local, and saves the result to dstLocal.

FusedMulAddRelu

Multiplies src0Local and dstLocal element-wise, adds them to src1Local, obtains the larger between the result and 0, and saves the final result to dstLocal.

Two-Operand Scalar Instructions

Adds

Performs addition between a scalar and a vector element-wise.

Muls

Performs multiplication between a scalar and a vector element-wise.

Maxs

Compares the vector source operand and a scalar element-wise and chooses the maximum.

Mins

Compares the vector source operand and a scalar element-wise and chooses the minimum.

ShiftLeft

Performs logical left shift on the source operand element-wise. The shift distance is determined by the scalar argument.

ShiftRight

Performs right shift on the source operand element-wise. The shift distance is determined by the scalar argument.

LeakyRelu

Computes Leaky ReLU on the source operand element-wise.

Triple-operand Scalar Instructions

Axpy

Sums up the product of each element in the source operand (srcLocal) and a scalar and the corresponding element in the destination operand (dstLocal).

Comparison Instructions

Compare

Compares the sizes of two tensors element by element. If the comparison result is true, the corresponding bit of the output result is 1. Otherwise, the bit is 0.

Compare (Result Stored in a Register)

Compares the sizes of two tensors element by element. If the comparison result is true, the corresponding bit of the output result is 1. Otherwise, the bit is 0. This interface can be used when the mask parameter is required. The result is stored in a register.

CompareScalar

Compares the sizes of an element in a tensor with that of a scalar element by element. If the comparison result is true, the corresponding bit of the output result is 1. Otherwise, the bit is 0.

Selection Instructions

Select

Selects the source operand src0 or src1 based on the bit value of selMask (mask used for selection) to obtain the destination operand dst. When the bit value of selMask is 1, src0 is selected. When the bit value of selMask is 0, src1 is selected.

GatherMask

Selects elements from the source operand and writes them to the destination operand based on a gather mask (for data collection) that corresponds to either the binary of the built-in fixed mode or the binary of the user-defined input tensor values.

Precision Conversion Instructions

Cast

Converts precision based on the data types of the source and destination operand tensors.

CastDeq

Quantizes the input and converts the precision.

Reduction Instructions

ReduceMax

Obtains the maximum value and its corresponding index position among the input data.

ReduceMin

Obtains the minimum value and its corresponding index position among the input data.

ReduceSum

Sums up all input data.

WholeReduceMax

Computes the maximum value and index of all data in each repeat.

WholeReduceMin

Computes the minimum value and index of all data in each repeat.

WholeReduceSum

Sums all data in each repeat.

BlockReduceMax

Computes the maximum of all elements in each data block.

BlockReduceMin

Computes the minimum of all elements in each data block.

BlockReduceSum

Sums all elements in each data block. Source operands are added in binary tree mode.

PairReduceSum

Sums two adjacent (odd and even) elements.

RepeatReduceSum

Sums all data in each repeat. Compared with WholeReduceSum, it does not support the bitwise mask mode. You are advised to use WholeReduceSum with more comprehensive functions.

Data Conversion

Transpose

Performs transpose on data blocks of a 16 x 16 2D matrix, and conversion between [N,C,H,W] and [N,H,W,C].

TransDataTo5HD

Converts the NCHW format to the NC1HWC0 format. It can also be used for transposing a two-dimensional matrix data block.

Data Padding

Duplicate

Copies a variable or an immediate for multiple times and fill it in the vector.

Brcb

Extracts eight elements from a given input tensor each time and fills them in eight data blocks (32 bytes) in the result tensor. Each element corresponds to a data block.

CreateVecIndex

Creates the vector index with firstValue as the start value.

Data Scatter/Data Gather

Gather

Gathers given input tensors by element to the result tensor based on the offset address tensor provided.

Mask Operations

SetMaskCount

Sets mask to counter mode. In this mode, you do not need to perceive the number of iterations or process unaligned tail blocks. You can directly pass in the amount of data to be computed. The actual number of iterations is automatically inferred by the Vector Unit.

SetMaskNorm

Sets mask to normal mode. This mode is the default mode. You can configure the number of iterations.

SetVectorMask

Sets mask during Vector computation.

ResetMask

Restores the mask value to the default (all 1s), indicating that all elements in each iteration participate in the Vector computation.

Quantization Settings

SetDeqScale

Sets the value of the DEQSCALE register.

Table 3 Data movement APIs

API

Function

DataCopy

Performs data movement, including common data movement, enhanced data movement, tiled data movement, and associated format conversion.

Copy

Performs the movement instruction between VECIN, VECCALC, and VECOUT, and supports the mask operation and data block interval operation.

Table 4 Memory management and synchronization control APIs

API

Function

TPipe

Manages resources such as the global memory. It allocates and manages resources such as memory.

GetTPipePtr

Obtains the TPipe pointer for the global memory managed by the framework. After obtaining the pointer, you can perform TPipe-related operations.

TBufPool

Manually manages or reuses the Unified Buffer/L1 Buffer physical memory. It is mainly used when the Unified Buffer/L1 Buffer physical memory is insufficient in multi-stage computing.

TQue

Performs EnQue and DeQue operations, and implements inter-task communication and synchronization through queues.

TQueBind

Binds the source and destination logical locations to determine the memory allocation location and insert the corresponding synchronization event, solving problems such as memory allocation, management, and synchronization.

TBuf

Manages the memory occupied by some temporary variables used during Ascend C programming.

InitSpmBuffer

Initializes the SPM buffer.

WriteSpmBuffer

Copies the data to be overflowed and temporarily stored to the SPM buffer.

ReadSpmBuffer

Reads data from the SPM buffer back to the local data.

GetUserWorkspace

Obtains the workspace pointer used by the user.

SetSysWorkSpace

Sets the pointer to the system workspace, as the system workspace is used by the framework communication mechanism during fused operator programming.

GetSysWorkSpacePtr

Obtains the pointer to the system workspace.

TQueSync

Provides synchronization control.

IBSet

Synchronizes the AI Cores to avoid data dependency problems such as write-after-read, read-after-write, and write-after-write when different AI Cores operate the same global memory block.

IBWait

Synchronizes the AI Cores to avoid data dependency problems such as write-after-read, read-after-write, and write-after-write when different AI Cores operate the same global memory block.

SyncAll

Synchronizes the AI Cores to avoid data dependency problems such as write-after-read, read-after-write, and write-after-write when different AI Cores operate the same global memory block.

InitDetermineComputeWorkspace

Initializes the value of the GM shared memory. WaitPreBlock and NotifyNextBlock can be called only after the initialization is complete.

WaitPreBlock

Reads the value in the GM address to determine whether to continue to wait. When the GM value meets the waiting condition of the current core, the core can proceed to the next operation.

NotifyNextBlock

Writes the GM address to notify the next core that the operation of the current core is completed and the next core can perform the operation.

Table 5 Cache processing APIs

API

Function

DataCachePreload

Preloads data from the specific DDR address where the source address is located to the data cache.

DataCacheCleanAndInvalid

Refreshes the cache to ensure cache consistency.

Table 6 System variable access API

API

Function

GetBlockNum

Obtains the number of blocks configured for the current task, which is used for multi-core logic control in the code.

GetBlockIdx

Obtains the index of the current core, which is used for multi-core logic control and multi-core offset computation in the code.

GetDataBlockSizeInBytes

Obtains the size (in byte) of a data block of the current chip version. You can compute the values of parameters such as repeatTimes, dataBlockStride, and repeatStride to be passed in the API instructions based on the data block size.

GetArchVersion

Obtains the version number of the current AI processor architecture.

GetTaskRation

Applies to the separated architecture and obtains the ratio of AICs to AIVs.

Table 7 Atomic operation APIs

API

Function

SetAtomicAdd

Sets whether to perform atomic addition for data transfer from VECOUT to GM, from L0C to GM, or from L1 to GM. The addition data type can be set based on different parameters.

SetAtomicType

Sets different atomic operation data types using template parameters.

SetAtomicNone

Clears the status of an atomic operation.

Table 8 Kernel tiling APIs

API

Function

GET_TILING_DATA

Obtains the tiling information input by the kernel entry point function of the operator and fills the information in the registered tiling structure. This function is compiled in macro expansion mode. If a user has registered multiple TilingData structures, this API is used to return the default registered structure.

GET_TILING_DATA_WITH_STRUCT

Specifies a structure name to obtain the specified tiling information and fill the information in the corresponding tiling structure. This function is compiled in macro expansion mode.

GET_TILING_DATA_MEMBER

Obtains the member variables of a tiling structure.

TILING_KEY_IS

Checks whether the tiling_key in the current kernel function execution is equal to a specific key, so as to identify a kernel branch with tiling_key==key.

REGISTER_TILING_DEFAULT

Registers the default TilingData structure defined by the user using the standard C++ syntax on the kernel.

REGISTER_TILING_FOR_TILINGKEY

Registers a custom TilingData structure that matches the TilingKey on the kernel. This API needs to provide a logical expression, which uses the string TILING_KEY_VAR to indicate the actual TilingKey and the range that the TilingKey meets.

KERNEL_TASK_TYPE_DEFAULT

Sets the global default kernel type, which applies to all tiling keys.

KERNEL_TASK_TYPE

Sets the kernel type corresponding to a specific tiling key.

Table 9 ISASI APIs

Category

API

Function

Vector computation

VectorPadding

Performs padding on the source operand by data block based on padMode and padSide.

BilinearInterpolation

Performs bilinear interpolation operations, including vertical iteration and horizontal iteration.

GetCmpMask

Obtains the comparison result of the Compare (Result Stored in a Register) instruction.

SetCmpMask

Sets the comparison register for the APIs where Select does not specify the mask parameter.

GetAccVal

Obtains the computation result of ReduceSum (based on the first n pieces of tensor data).

GetReduceMaxMinCount

Obtains the maximum/minimum values and the corresponding index values in the scenario where ReduceMax and ReduceMin are consecutive.

ProposalConcat

Inserts consecutive elements into the corresponding positions in Region Proposals. In each iteration, 16 consecutive elements are inserted into the corresponding positions in 16 Region Proposals.

ProposalExtract

Extracts elements from corresponding positions in Region Proposals and rearranges them. In each iteration, 16 elements are extracted from 16 Region Proposals and arranged consecutively. The functionality of this API is the opposite of that of ProposalConcat.

RpSort16

Sorts the Region Proposals based on their score fields in descending order. 16 Region Proposals are sorted in each iteration.

MrgSort4

Merges at most four sorted Region Proposal lists into one. The results are sorted in descending order of the score fields.

Sort32

Serves as a sorting function that can sort a maximum of 32 elements in each iteration.

MrgSort

Merges at most four sorted lists into one. The results are sorted in descending order of the score fields.

GetMrgSortResult

Obtains the number of region proposals in the queue processed by MrgSort or MrgSort4 and stores the number in the four List arguments in sequence.

Gatherb

Gathers a given input tensor to the result tensor based on the offset address tensor provided.

Scatter

Generates a new result tensor based on a given continuous input tensor, a destination address offset tensor, and the offset address, and distributes the input tensor to the result tensor.

Cube Computation

InitConstValue

Initializes LocalTensor (TPosition: A1, A2, B1, or B2) to a specific value.

LoadData

Provides the Load2D and Load3D data loading functions.

LoadDataWithTranspose

Loads 2D data with transposing from A1/B1 to A2/B2.

SetAippFunctions

Sets AI preprocessing (AIPP) parameters for images.

LoadImageToLocal

Moves image data from GM to A1/B1. During data movement, you can preprocess images, including image flipping, image resizing (clipping, cropping, scaling, and stretching), color space conversion (CSC), and type conversion.

LoadUnZipIndex

Loads the compression index table on the GM to internal registers.

LoadDataUnzip

Decompress the data on the GM and move the data to A1, B1, and B2.

LoadDataWithSparse

Moves the 512-byte dense weight matrix stored in B1 to B2, and reads the 128-byte index matrix for sparseness of the dense matrix.

SetFmatrix

Sets the attribute description of the feature map when Load3Dv1/Load3Dv2 is called.

SetLoadDataBoundary

Sets A1/B1 boundary value when Load3D is called.

SetLoadDataRepeat

Sets the repeat parameter of the Load3Dv2 API. After the repeat parameter is set, the Load3Dv2 API can be called once to complete the data movement for multiple iterations.

SetLoadDataPaddingValue

Sets padValue for Load3Dv1/Load3Dv2.

Mmad

Performs matrix multiplication and addition.

MmadWithSparse

Performs matrix multiplication and addition operations. The input left matrix A is a sparse matrix, and the input right matrix B is a dense matrix.

Fixpipe

Processes the result after the matrix computation is complete. For example, the computation result is quantized and the data is moved from CO1 to the global memory.

SetFixPipeConfig

Sets the source operands of ReLU and quant. In the Fixpipe process, the ReLU (FixpipeParams. reluEn is set to true) and quant (FixpipeParams.QuantParams is set to a value other than NoQuant) processes are involved for ReLU and quant computation, respectively.

SetFixpipeNz2ndFlag

Configures FixpipeNz2nd. In the Fixpipe process, the nz2nd process is involved (FixpipeParams.Nz2NdParams.nz2ndEn is set to true).

SetFixpipePreQuantFlag

Sets deq scalar (quantization parameter) in the quantization process involved in Fixpipe.

SetFixPipeClipRelu

Sets the maximum value of the ClipReLU operation after real-time quantization is performed during DataCopy (CO1 -> GM).

SetFixPipeAddr

Sets the address of the LocalTensor during the element-wise operation after real-time quantization is performed during DataCopy (CO1 -> GM).

SetHF32Mode

Sets register values, which is similar to SetHF32TransMode and SetMMLayoutTransform. SetHF32Mode is used to set the HF32 mode of the MMAD.

SetHF32TransMode

Sets register values, which is similar to SetHF32Mode and SetMMLayoutTransform. SetHF32TransMode is used to set the HF32 rounding mode of the MMAD. It is valid only when the HF32 mode of the MMAD takes effect.

SetMMLayoutTransform

Sets register values, which is similar to SetHF32Mode and SetHF32TransMode. SetMMLayoutTransform is used to set the M/N direction of the MMAD.

CheckLocalMemoryIA

Monitors the UB read and write operations within the specified range. If the UB read and write operations within the specified range are monitored, an EXCEPTION error is reported. If the UB read and write operations within the specified range are not monitored, no error is reported.

Conv2D

Performs 2D convolution on a given input tensor and a weight tensor and outputs a result tensor. The Conv2d convolution layer is mostly used for image recognition, and a filter is used to extract features in an image.

Gemm

Multiplies two tensors and outputs a result tensor. Multiply matrix A and matrix B to obtain matrix C, and output matrix C.

Data Movement

DataCopyPad

Enables data non-aligned movement.

SetPadValue

Sets the value filled by DataCopyPad.

Synchronization Control

SetFlag/WaitFlag

Synchronizes different pipelines in the same core. This synchronization operation needs to be inserted between different pipeline instructions with data dependency.

PipeBarrier

Blocks a pipeline. This synchronization operation needs to be inserted between the same pipelines with data dependency.

DataSyncBarrier

Blocks the execution of subsequent instructions until all previous memory access instructions (the memory location to be waited for can be controlled by parameters) are executed.

CrossCoreSetFlag

Sets the synchronization between AICs and AIVs in the separated architecture.

CrossCoreWaitFlag

Waits for the synchronization between AICs and AIVs in the separated architecture.

Cache Processing

ICachePreLoad

Preloads instructions to the iCache from the DDR address where the instructions are located.

GetICachePreloadStatus

Obtains the PreLoad status of the iCache.

System Variable Access

GetProgramCounter

Obtains the pointer to the program counter, which is used to record the current program execution position.

GetSubBlockNum

Obtains the number of Vector Cores on the AI Core.

GetSubBlockIdx

Obtains the ID of the Vector Core on the AI Core.

GetSystemCycle

Obtains the number of cycles in the current system. If the number of cycles is converted to time (unit: μs), the frequency must be 50 MHz. The conversion formula is as follows: Time = (Number of cycles/50) μs.

Atomic Operations

SetAtomicMax

Sets whether to perform atomic comparison for subsequent data transferred from VECOUT to GM, which compares the content to be copied with the existing content in GM and writes the maximum value to GM.

SetAtomicMin

Sets whether to perform atomic comparison for subsequent data transferred from VECOUT to GM, which compares the content to be copied with the existing content in GM and writes the minimum value to GM.

SetStoreAtomicConfig

Sets the atomic operation enabling flag and type.

GetStoreAtomicConfig

Obtains the value of the enabling flag and type of the atomic operation.

Resource Management

CubeResGroupHandle

Controls communication between the AIC and the AIV through software synchronization in the separated architecture to implement compute resource grouping of AI Core.

GroupBarrier

Controls synchronization when two AIV tasks in the same CubeResGroupHandle object depend on each other.

KfcWorkspace

Serves as the communication workspace descriptor, used to manage the message communication area division of different CubeResGroupHandle objects. It is used together with CubeResGroupHandle. The KfcWorkspace constructor is used to create a KfcWorkspace object.

Kernel API - High-Level APIs

Table 10 Math library APIs

API

Function

Acos

Computes arc cosine element-wise.

Acosh

Computes inverse hyperbolic cosine element-wise.

Asin

Computes arcsine element-wise.

Asinh

Computes hyperbolic arcsine element-wise.

Atan

Computes arc tangent of a trigonometric function element-wise.

Atanh

Computes inverse hyperbolic tangent element-wise.

Axpy

Adds the product of each element of the source operand and a scalar to the corresponding element in the destination operand.

Ceil

Obtains the minimum integer value greater than or equal to x, that is, rounding towards positive infinity.

ClampMax

Replaces the number greater than scalar with scalar in srcTensor and retains the number less than or equal to scalar as the dstTensor output.

ClampMin

Replaces the number less than scalar with scalar in srcTensor and retains the number greater than or equal to scalar as the dstTensor output.

Cos

Computes cosine of a trigonometric function element-wise.

Cosh

Computes hyperbolic cosine element-wise.

CumSum

Accumulates data by row or column.

Digamma

Computes the logarithmic derivative of the gamma function of x element-wise.

Erf

Computes error function or Gaussian error function element-wise.

Erfc

Returns the complementary error function computing result of input x. The integral ranges from x to infinity.

Exp

Computes the natural exponent element-wise.

Floor

Obtains the minimum integer value less than or equal to x, that is, rounding towards negative infinity.

Fmod

Computes the remainder of two floating-point numbers element-wise.

Frac

Returns decimals element-wise.

Lgamma

Computes the absolute value and natural logarithm of the gamma function of x element-wise.

Log

Computes logarithm of bases e, 2, and 10 element-wise.

Power

Computes exponentiation element-wise.

Round

Rounds the input element to the nearest integer.

Sign

Performs the Sign operation element-wise. Sign refers to the symbol that returns the input data.

Sin

Computes sine element-wise.

Sinh

Computes hyperbolic sine element-wise.

Tan

Computes tangent element-wise.

Tanh

Performs logistic regression Tanh element-wise.

Trunc

Truncates floating point numbers element-wise, that is, rounding towards zero.

Xor

Performs the XOR operation element-wise.

Table 11 Quantization and dequantization APIs

API

Function

AscendAntiQuant

Performs fake quantization by element. For example, apply fake quantization to convert the int8_t data type to the half type.

AscendDequant

Performs dequantization by element. For example, dequantize the int32_t data type to the half/float data type.

AscendQuant

Performs quantization by element. For example, quantize the half/float data type to the int8_t data type.

Table 12 Data normalization APIs

API

Function

BatchNorm

Normalizes each input feature of samples in each batch along the batch dimension.

DeepNorm

Serves as a replacement for LayerNorm normalization during the training process of a deep neural network.

GroupNorm

Divides the input C dimension into groups (groupNum) and standardizes each group of data.

LayerNorm

Normalizes the input data of network layers to the [0, 1] range to standardize the distributions of both input and output data across network layers.

LayerNormGrad

Computes the backpropagation gradient of LayerNorm.

LayerNormGradBeta

Obtains the reverse beta/gmma value and outputs pdx, gmma, and beta when used in conjunction with LayerNormGrad.

Normalize

Computes the reciprocal rstd of the standard deviation of the input data with shape [A, R] and the normalized output y based on the known mean value and variance in LayerNorm.

RmsNorm

Normalizes input data whose shape is [B, S, H] using RmsNorm.

WelfordUpdate

Implements preprocessing of the Welford algorithm.

WelfordFinalize

Implements postprocessing of the Welford algorithm.

Table 13 Activation function APIs

API

Function

AdjustSoftMaxRes

Performs postprocessing on SoftMax compute results and adjusts SoftMax compute results to specified values.

FasterGelu

Implements an activation function of the simplified FastGelu version.

FasterGeluV2

Implements an activation function of the FastGeluV2 version.

GeGLU

Serves as a GLU variant that uses GeLU as the activation function.

Gelu

Serves as an important activation function that is inspired by ReLU and dropout. The idea of random regular expression is introduced in activation.

LogSoftMax

Performs LogSoftmax computation on the input tensor.

ReGlu

Serves as a GLU variant that uses ReLU as the activation function.

Sigmoid

Performs logistic regression with Sigmoid element-wise.

Silu

Computes Silu element-wise.

SimpleSoftMax

Uses the computed sum and max data to perform softmax computation on the input tensor.

SoftMax

Performs softmax computation on input tensors by row.

SoftmaxFlash

Serves as the enhanced version of SoftMax, which not only performs softmaxflash computation on the input tensor, but updates the result of the current softmax computation based on the sum and max values obtained in the previous softmax computation.

SoftmaxFlashV2

Serves as the enhanced version of SoftmaxFlash, corresponding to the FlashAttention-2 algorithm.

SoftmaxGrad

Performs gradient backpropagation on input tensors.

SoftmaxGradFront

Performs gradient backpropagation on input tensors.

SwiGLU

Serves as a GLU variant that uses Swish as the activation function.

Swish

Serves as a Swish activation function in neural networks.

Table 14 Reduction APIs

API

Function

Mean

Computes the mean of elements according to the direction of the last axis.

ReduceXorSum

Performs the XOR (bitwise XOR) operation by element and computes the sum of the results using ReduceSum.

Sum

Obtains the sum of elements in the last dimension.

Table 15 Sorting APIs

API

Function

TopK

Obtains the first k maximum or minimum values of the last dimension and their corresponding indexes.

Concat

Preprocesses the data and merges the source operand srcLocal to be sorted into the target data concatLocal. After the data is preprocessed, you can sort the data.

Extract

Processes the sorting result data and outputs the sorted values and indexes.

Sort

Sorts data in descending order by value.

MrgSort

Merges at most four sorted lists into one. The results are sorted in descending order of the score fields.

Table 16 Data padding APIs

API

Function

BroadCast

Broadcasts the input based on the output shape.

Pad

Pads the two-dimensional tensor (height x width) to 32-bytes alignment in the width direction.

UnPad

Unpads a two-dimensional tensor (height x width) in the width direction.

Table 17 Data filtering APIs

API

Function

DropOut

Provides the function of filtering the source operand based on the mask tensor to obtain the destination operand.

Table 18 Comparing and selecting APIs

API

Function

SelectWithBytesMask

Given two source operands src0 and src1, selects elements based on the values (non-bit) of corresponding positions of maskTensor to obtain the destination operand dst.

Table 19 Deformation APIs

API

Function

ConfusionTranspose

Performs data format and reshape operations on the input data.

Table 20 Index operation APIs

API

Function

ArithProgression

Returns an arithmetic progression given the start value, arithmetical value, and length.

Table 21 Matmul APIs

API

Function

Matmul

Performs matrix multiplications.

Table 22 HCCL APIs

API

Function

Hccl

Flexibly orchestrates collective communication tasks on the AI Core.

Table 23 Tool APIs

API

Function

InitGlobalMemory

Initializes data in the global memory to a specified value.

Host API

Table 24 Host APIs

Category

API

Function

Prototype registration and management

Prototype Registration API (OP_ADD)

Registers the prototype definition of an operator.

OpDef

Defines the operator prototype.

OpParamDef

Defines operator parameters.

OpAttrDef

Defines operator attributes.

OpAICoreDef

Defines the implementation information of the AI processor and associates the tiling implementation and shape inference functions.

OpAICoreConfig

Configures AI Core information.

OpMC2Def

Configures the communicator name of the MC2 operator on the host. After the configuration, the context address corresponding to the communicator can be obtained on the kernel.

Tiling data structure registration

TilingData Structure Definition

Defines a TilingData class and adds required member variables (TilingData fields) to store required TilingData parameters. After the TilingData class is defined, this class inherits the TilingDef class (base class for storing and processing user-defined Tiling structure member variables) to provide APIs for setting, serializing, and saving TilingData fields.

TilingData Structure Registration

Registers the defined TilingData structure and binds it with a custom operator.

ContextBuilder

Provides a series of APIs for you to manually build the TilingContext class to verify the tiling functions and the KernelContext class to verify the TilingParse functions.

Template Argument Definition

Defines the template argument declaration ASCENDC_TPL_ARGS_DECL and template argument selection ASCENDC_TPL_ARGS_SEL (available template).

GET_TPL_TILING_KEY

Automatically generates a TilingKey during tiling template programming. This API converts the passed template parameters into binary values based on the defined bit width, combines the binary values in sequence, and then converts the values into uint64, that is, TilingKey.

Platform information acquisition

PlatformAscendC

Obtains certain hardware platform information, such as the number of cores on a hardware platform for tiling computation to implement the Tiling function on the host. The PlatformAscendC class provides a function for obtaining such platform information.

PlatformAscendCManager

Obtains hardware platform information, such as the number of cores on the hardware platform, to call operators in the basic mode (kernel launch) based on the kernel launch operator project. The PlatformAscendCManager class provides the function of obtaining platform information.

Operator Debugging API

Table 25 Operator debugging APIs

API

Function

DumpTensor

Dumps the content of specified tensors for operators developed based on operator projects.

printf

Implements the formatted output function in CPU- or NPU-side debugging for operators developed based on operator projects.

assert

Implements the assert function in CPU/NPU for operators developed based on operator projects.

DumpAccChkPoint

Dumps the content of specified tensors for operators developed based on operator projects. This API can be used to print tensors at a specified offset position.

Trap

Stops the kernel when a software exception occurs.

GmAlloc

Creates shared memory during verification of the CPU-side operation of the kernel function. That is, creates a shared file in the /tmp directory and returns the mapping pointer to the file.

ICPU_RUN_KF

Functions as the CPU commissioning entry and completes calls to CPU operator programs during verification of the CPU-side operation of the kernel function.

ICPU_SET_TILING_KEY

Specifies tilingKey used for the current CPU debugging. During debugging, only the branch to which tilingKey corresponds in the operator kernel function is executed.

GmFree

Frees the shared memory allocated by GmAlloc during verification of the CPU-side operation of the kernel function.

SetKernelMode

Sets the kernel mode to the single AIV mode, single AIC mode, or MIX mode to enable CPU commissioning of single AIV (vector) operators, single AIC (cube) operators, or MIX operators, respectively.

TRACE_START

Performs dotting in any running phase of the operator when the CAModel is used for operator performance simulation, to analyze the pipeline diagrams of different instructions for further performance tuning.

Indicates dotting from the start point. This API is used together with TRACE_STOP.

TRACE_STOP

Performs dotting in any running phase of the operator when the CAModel is used for operator performance simulation, to analyze the pipeline diagrams of different instructions for further performance tuning.

Indicates dotting from the end point. This API is used together with TRACE_START.

MetricsProfStart

Starts the profile data collection. This API is used together with MetricsProfStop. When using msProf for operator on-board tuning, you can call MetricsProfStart and MetricsProfStop before and after the code segment on the kernel to specify the scope of the code segment to be tuned.

MetricsProfStop

Stops the profile data collection. This API is used together with MetricsProfStart. When using msProf for operator on-board tuning, you can call MetricsProfStart and MetricsProfStop before and after the code segment on the kernel to specify the scope of the code segment to be tuned.