Ascend C APIs

Ascend C provides a group of class library APIs. You can use the standard C++ syntax and class library APIs for programming. Ascend C programming class library APIs are classified into the following types:

Kernel APIs: implement the operator kernel function, including:
- Basic data structures: basic data structures used in kernel APIs, such as GlobalTensor and LocalTensor.
- Basic APIs: abstract hardware capabilities and open chip capabilities to ensure completeness and compatibility. APIs marked as Instruction Set Architecture Special Interface (ISASI, hardware architecture-related APIs) do not guarantee compatibility across hardware versions.
- High-level APIs: implement common computing algorithms to improve programming and development efficiency based on basic APIs. High-level APIs include math library, Matmul, Softmax, and others, and ensure compatibility.
Host APIs:
- Tiling APIs: provide tiling parameters required for kernel computation.
- Ascend C operator prototype registration and management APIs: define and register the Ascend C operator prototype.
- Tiling data structure registration APIs: define and register the TilingData structure of the Ascend C operator.
- Platform information obtaining APIs: provide functions to obtain hardware platform information, such as the number of platform cores, to support tiling computation during tiling function implementation on the host.
Operator debugging APIs: used for operator debugging, including twin debugging and performance debugging.

Basic data structures and APIs are required for Ascend C operator programming on the host. For details, see Basic Data Structures and APIs. Runtime APIs are required for operator calling after operator development. For details, see "AscendCL API Reference" in CANN AscendCL Application Software Development Guide (C&C++).

Kernel API - Basic APIs

**Table 1** Scalar computation APIs
API	Function
ScalarGetCountOfValue	Obtains the number of 0s or 1s in a binary number of the uint64_t type.
ScalarCountLeadingZero	Computes the number of leading 0s of a uint64_t number (number of 0s from the most significant bit to the first 1 in the binary number).
ScalarCast	Converts the type of a scalar to a specified type.
CountBitsCntSameAsSignBit	Computes the number of consecutive bits that are the same as the sign bit from the most significant bit in the binary number of the uint64_t type.
ScalarGetSFFValue	Obtains the location where the first 0 or 1 appears in a binary number of the uint64_t type.
ToBfloat16	Converts scalar data of the float type to scalar data of the bfloat16_t type.
ToFloat	Converts scalar data of the bfloat16_t type to scalar data of the float type.

**Table 2** Vector computation APIs
Category	API	Function
One-Operand Instructions	Exp	Computes the natural exponent based on elements.
	Ln	Computes the natural logarithm based on elements.
	Abs	Computes the absolute value based on elements.
	Reciprocal	Computes the reciprocal based on elements.
	Sqrt	Extracts the square root based on elements.
	Rsqrt	Computes the reciprocal after square root extraction based on elements.
	Not	Performs bitwise Not based on elements.
	Relu	Performs a ReLU operation based on elements.
Two-Operand Instructions	Add	Performs addition based on elements.
	Sub	Performs subtraction based on elements.
	Mul	Performs multiplication based on elements.
	Div	Performs division based on elements.
	Max	Computes the maximum value based on elements.
	Min	Computes the minimum value based on elements.
	And	Performs a bitwise AND operation based on elements.
	Or	Performs a bitwise OR operation based on elements.
	AddRelu	Adds inputs element-wise and chooses the larger between the result and 0.
	AddReluCast	Adds inputs element-wise and chooses the larger between the result and 0, and converts precision based on the data types of the source and destination operand tensors.
	AddDeqRelu	Adds inputs element-wise, performs Deq quantization on the result, and then performs ReLU calculation on the result (obtains the larger between the result and 0).
	SubRelu	Computes the difference element-wise and chooses the larger between the result and 0.
	SubReluCast	Computes the difference element-wise and chooses the larger between the result and 0, and converts precision based on the data types of the source and destination operand tensors.
	MulAddDst	Multiplies src0Local and src1Local element-wise, adds them to dstLocal, and saves the final result to dstLocal.
	MulCast	Performs precision conversion after the product is calculated based on elements.
	FusedMulAdd	Multiplies src0Local and dstLocal element-wise, adds src1Local, and saves the result to dstLocal.
	FusedMulAddRelu	Multiplies src0Local and dstLocal element-wise, adds them to src1Local, obtains the larger between the result and 0, and saves the final result to dstLocal.
Two-Operand Scalar Instructions	Adds	Performs addition between a scalar and a vector element-wise.
	Muls	Performs multiplication between a scalar and a vector element-wise.
	Maxs	Compares the vector source operand and a scalar element-wise and chooses the maximum.
	Mins	Compares the vector source operand and a scalar element-wise and chooses the minimum.
	ShiftLeft	Performs logical left shift on the source operand element-wise. The shift distance is determined by the scalar argument.
	ShiftRight	Performs right shift on the source operand element-wise. The shift distance is determined by the scalar argument.
	LeakyRelu	Computes Leaky ReLU on the source operand element-wise.
Triple-operand Scalar Instructions	Axpy	Sums up the product of each element in the source operand (srcLocal) and a scalar and the corresponding element in the destination operand (dstLocal).
Comparison Instructions	Compare	Compares the sizes of two tensors element by element. If the comparison result is true, the corresponding bit of the output result is 1. Otherwise, the bit is 0.
	Compare (Result Stored in a Register)	Compares the sizes of two tensors element by element. If the comparison result is true, the corresponding bit of the output result is 1. Otherwise, the bit is 0. This interface can be used when the mask parameter is required. The result is stored in a register.
	CompareScalar	Compares the sizes of an element in a tensor with that of a scalar element by element. If the comparison result is true, the corresponding bit of the output result is 1. Otherwise, the bit is 0.
Selection Instructions	Select	Selects the source operand src0 or src1 based on the bit value of selMask (mask used for selection) to obtain the destination operand dst. When the bit value of selMask is 1, src0 is selected. When the bit value of selMask is 0, src1 is selected.
Selection Instructions	GatherMask	Selects elements from the source operand and writes them to the destination operand based on a gather mask (for data collection) that corresponds to either the binary of the built-in fixed mode or the binary of the user-defined input tensor values.
Precision Conversion Instructions	Cast	Converts precision based on the data types of the source and destination operand tensors.
Precision Conversion Instructions	CastDeq	Quantizes the input and converts the precision.
Reduction Instructions	ReduceMax	Obtains the maximum value and its corresponding index position among the input data.
	ReduceMin	Obtains the minimum value and its corresponding index position among the input data.
	ReduceSum	Sums up all input data.
	WholeReduceMax	Computes the maximum value and index of all data in each repeat.
	WholeReduceMin	Computes the minimum value and index of all data in each repeat.
	WholeReduceSum	Sums all data in each repeat.
	BlockReduceMax	Computes the maximum of all elements in each data block.
	BlockReduceMin	Computes the minimum of all elements in each data block.
	BlockReduceSum	Sums all elements in each data block. Source operands are added in binary tree mode.
	PairReduceSum	Sums two adjacent (odd and even) elements.
	RepeatReduceSum	Sums all data in each repeat. Compared with WholeReduceSum, it does not support the bitwise mask mode. You are advised to use WholeReduceSum with more comprehensive functions.
Data Conversion	Transpose	Performs transpose on data blocks of a 16 x 16 2D matrix, and conversion between [N,C,H,W] and [N,H,W,C].
Data Conversion	TransDataTo5HD	Converts the NCHW format to the NC1HWC0 format. It can also be used for transposing a two-dimensional matrix data block.
Data Padding	Duplicate	Copies a variable or an immediate for multiple times and fill it in the vector.
	Brcb	Extracts eight elements from a given input tensor each time and fills them in eight data blocks (32 bytes) in the result tensor. Each element corresponds to a data block.
	CreateVecIndex	Creates the vector index with firstValue as the start value.
Data Scatter/Data Gather	Gather	Gathers given input tensors by element to the result tensor based on the offset address tensor provided.
Mask Operations	SetMaskCount	Sets mask to counter mode. In this mode, you do not need to perceive the number of iterations or process unaligned tail blocks. You can directly pass in the amount of data to be computed. The actual number of iterations is automatically inferred by the Vector Unit.
	SetMaskNorm	Sets mask to normal mode. This mode is the default mode. You can configure the number of iterations.
	SetVectorMask	Sets mask during Vector computation.
	ResetMask	Restores the mask value to the default (all 1s), indicating that all elements in each iteration participate in the Vector computation.
Quantization Settings	SetDeqScale	Sets the value of the DEQSCALE register.

**Table 3** Data movement APIs
API	Function
DataCopy	Performs data movement, including common data movement, enhanced data movement, tiled data movement, and associated format conversion.
Copy	Performs the movement instruction between VECIN, VECCALC, and VECOUT, and supports the mask operation and data block interval operation.

**Table 4** Memory management and synchronization control APIs
API	Function
TPipe	Manages resources such as the global memory. It allocates and manages resources such as memory.
GetTPipePtr	Obtains the TPipe pointer for the global memory managed by the framework. After obtaining the pointer, you can perform TPipe-related operations.
TBufPool	Manually manages or reuses the Unified Buffer/L1 Buffer physical memory. It is mainly used when the Unified Buffer/L1 Buffer physical memory is insufficient in multi-stage computing.
TQue	Performs EnQue and DeQue operations, and implements inter-task communication and synchronization through queues.
TQueBind	Binds the source and destination logical locations to determine the memory allocation location and insert the corresponding synchronization event, solving problems such as memory allocation, management, and synchronization.
TBuf	Manages the memory occupied by some temporary variables used during Ascend C programming.
InitSpmBuffer	Initializes the SPM buffer.
WriteSpmBuffer	Copies the data to be overflowed and temporarily stored to the SPM buffer.
ReadSpmBuffer	Reads data from the SPM buffer back to the local data.
GetUserWorkspace	Obtains the workspace pointer used by the user.
SetSysWorkSpace	Sets the pointer to the system workspace, as the system workspace is used by the framework communication mechanism during fused operator programming.
GetSysWorkSpacePtr	Obtains the pointer to the system workspace.
TQueSync	Provides synchronization control.
IBSet	Synchronizes the AI Cores to avoid data dependency problems such as write-after-read, read-after-write, and write-after-write when different AI Cores operate the same global memory block.
IBWait	Synchronizes the AI Cores to avoid data dependency problems such as write-after-read, read-after-write, and write-after-write when different AI Cores operate the same global memory block.
SyncAll	Synchronizes the AI Cores to avoid data dependency problems such as write-after-read, read-after-write, and write-after-write when different AI Cores operate the same global memory block.
InitDetermineComputeWorkspace	Initializes the value of the GM shared memory. WaitPreBlock and NotifyNextBlock can be called only after the initialization is complete.
WaitPreBlock	Reads the value in the GM address to determine whether to continue to wait. When the GM value meets the waiting condition of the current core, the core can proceed to the next operation.
NotifyNextBlock	Writes the GM address to notify the next core that the operation of the current core is completed and the next core can perform the operation.

**Table 5** Cache processing APIs
API	Function
DataCachePreload	Preloads data from the specific DDR address where the source address is located to the data cache.
DataCacheCleanAndInvalid	Refreshes the cache to ensure cache consistency.

**Table 6** System variable access API
API	Function
GetBlockNum	Obtains the number of blocks configured for the current task, which is used for multi-core logic control in the code.
GetBlockIdx	Obtains the index of the current core, which is used for multi-core logic control and multi-core offset computation in the code.
GetDataBlockSizeInBytes	Obtains the size (in byte) of a data block of the current chip version. You can compute the values of parameters such as repeatTimes, dataBlockStride, and repeatStride to be passed in the API instructions based on the data block size.
GetArchVersion	Obtains the version number of the current AI processor architecture.
GetTaskRation	Applies to the separated architecture and obtains the ratio of AICs to AIVs.

**Table 7** Atomic operation APIs
API	Function
SetAtomicAdd	Sets whether to perform atomic addition for data transfer from VECOUT to GM, from L0C to GM, or from L1 to GM. The addition data type can be set based on different parameters.
SetAtomicType	Sets different atomic operation data types using template parameters.
SetAtomicNone	Clears the status of an atomic operation.

**Table 8** Kernel tiling APIs
API	Function
GET_TILING_DATA	Obtains the tiling information input by the kernel entry point function of the operator and fills the information in the registered tiling structure. This function is compiled in macro expansion mode. If a user has registered multiple TilingData structures, this API is used to return the default registered structure.
GET_TILING_DATA_WITH_STRUCT	Specifies a structure name to obtain the specified tiling information and fill the information in the corresponding tiling structure. This function is compiled in macro expansion mode.
GET_TILING_DATA_MEMBER	Obtains the member variables of a tiling structure.
TILING_KEY_IS	Checks whether the tiling_key in the current kernel function execution is equal to a specific key, so as to identify a kernel branch with tiling_key==key.
REGISTER_TILING_DEFAULT	Registers the default TilingData structure defined by the user using the standard C++ syntax on the kernel.
REGISTER_TILING_FOR_TILINGKEY	Registers a custom TilingData structure that matches the TilingKey on the kernel. This API needs to provide a logical expression, which uses the string TILING_KEY_VAR to indicate the actual TilingKey and the range that the TilingKey meets.
KERNEL_TASK_TYPE_DEFAULT	Sets the global default kernel type, which applies to all tiling keys.
KERNEL_TASK_TYPE	Sets the kernel type corresponding to a specific tiling key.

**Table 9** ISASI APIs
Category	API	Function
Vector computation	VectorPadding	Performs padding on the source operand by data block based on padMode and padSide.
	BilinearInterpolation	Performs bilinear interpolation operations, including vertical iteration and horizontal iteration.
	GetCmpMask	Obtains the comparison result of the Compare (Result Stored in a Register) instruction.
	SetCmpMask	Sets the comparison register for the APIs where Select does not specify the mask parameter.
	GetAccVal	Obtains the computation result of ReduceSum (based on the first n pieces of tensor data).
	GetReduceMaxMinCount	Obtains the maximum/minimum values and the corresponding index values in the scenario where ReduceMax and ReduceMin are consecutive.
	ProposalConcat	Inserts consecutive elements into the corresponding positions in Region Proposals. In each iteration, 16 consecutive elements are inserted into the corresponding positions in 16 Region Proposals.
	ProposalExtract	Extracts elements from corresponding positions in Region Proposals and rearranges them. In each iteration, 16 elements are extracted from 16 Region Proposals and arranged consecutively. The functionality of this API is the opposite of that of ProposalConcat.
	RpSort16	Sorts the Region Proposals based on their score fields in descending order. 16 Region Proposals are sorted in each iteration.
	MrgSort4	Merges at most four sorted Region Proposal lists into one. The results are sorted in descending order of the score fields.
	Sort32	Serves as a sorting function that can sort a maximum of 32 elements in each iteration.
	MrgSort	Merges at most four sorted lists into one. The results are sorted in descending order of the score fields.
	GetMrgSortResult	Obtains the number of region proposals in the queue processed by MrgSort or MrgSort4 and stores the number in the four List arguments in sequence.
	Gatherb	Gathers a given input tensor to the result tensor based on the offset address tensor provided.
	Scatter	Generates a new result tensor based on a given continuous input tensor, a destination address offset tensor, and the offset address, and distributes the input tensor to the result tensor.
Cube Computation	InitConstValue	Initializes LocalTensor (TPosition: A1, A2, B1, or B2) to a specific value.
	LoadData	Provides the Load2D and Load3D data loading functions.
	LoadDataWithTranspose	Loads 2D data with transposing from A1/B1 to A2/B2.
	SetAippFunctions	Sets AI preprocessing (AIPP) parameters for images.
	LoadImageToLocal	Moves image data from GM to A1/B1. During data movement, you can preprocess images, including image flipping, image resizing (clipping, cropping, scaling, and stretching), color space conversion (CSC), and type conversion.
	LoadUnZipIndex	Loads the compression index table on the GM to internal registers.
	LoadDataUnzip	Decompress the data on the GM and move the data to A1, B1, and B2.
	LoadDataWithSparse	Moves the 512-byte dense weight matrix stored in B1 to B2, and reads the 128-byte index matrix for sparseness of the dense matrix.
	SetFmatrix	Sets the attribute description of the feature map when Load3Dv1/Load3Dv2 is called.
	SetLoadDataBoundary	Sets A1/B1 boundary value when Load3D is called.
	SetLoadDataRepeat	Sets the repeat parameter of the Load3Dv2 API. After the repeat parameter is set, the Load3Dv2 API can be called once to complete the data movement for multiple iterations.
	SetLoadDataPaddingValue	Sets padValue for Load3Dv1/Load3Dv2.
	Mmad	Performs matrix multiplication and addition.
	MmadWithSparse	Performs matrix multiplication and addition operations. The input left matrix A is a sparse matrix, and the input right matrix B is a dense matrix.
	Fixpipe	Processes the result after the matrix computation is complete. For example, the computation result is quantized and the data is moved from CO1 to the global memory.
	SetFixPipeConfig	Sets the source operands of ReLU and quant. In the Fixpipe process, the ReLU (FixpipeParams. reluEn is set to true) and quant (FixpipeParams.QuantParams is set to a value other than NoQuant) processes are involved for ReLU and quant computation, respectively.
	SetFixpipeNz2ndFlag	Configures FixpipeNz2nd. In the Fixpipe process, the nz2nd process is involved (FixpipeParams.Nz2NdParams.nz2ndEn is set to true).
	SetFixpipePreQuantFlag	Sets deq scalar (quantization parameter) in the quantization process involved in Fixpipe.
	SetFixPipeClipRelu	Sets the maximum value of the ClipReLU operation after real-time quantization is performed during DataCopy (CO1 -> GM).
	SetFixPipeAddr	Sets the address of the LocalTensor during the element-wise operation after real-time quantization is performed during DataCopy (CO1 -> GM).
	SetHF32Mode	Sets register values, which is similar to SetHF32TransMode and SetMMLayoutTransform. SetHF32Mode is used to set the HF32 mode of the MMAD.
	SetHF32TransMode	Sets register values, which is similar to SetHF32Mode and SetMMLayoutTransform. SetHF32TransMode is used to set the HF32 rounding mode of the MMAD. It is valid only when the HF32 mode of the MMAD takes effect.
	SetMMLayoutTransform	Sets register values, which is similar to SetHF32Mode and SetHF32TransMode. SetMMLayoutTransform is used to set the M/N direction of the MMAD.
	CheckLocalMemoryIA	Monitors the UB read and write operations within the specified range. If the UB read and write operations within the specified range are monitored, an EXCEPTION error is reported. If the UB read and write operations within the specified range are not monitored, no error is reported.
	Conv2D	Performs 2D convolution on a given input tensor and a weight tensor and outputs a result tensor. The Conv2d convolution layer is mostly used for image recognition, and a filter is used to extract features in an image.
	Gemm	Multiplies two tensors and outputs a result tensor. Multiply matrix A and matrix B to obtain matrix C, and output matrix C.
Data Movement	DataCopyPad	Enables data non-aligned movement.
Data Movement	SetPadValue	Sets the value filled by DataCopyPad.
Synchronization Control	SetFlag/WaitFlag	Synchronizes different pipelines in the same core. This synchronization operation needs to be inserted between different pipeline instructions with data dependency.
	PipeBarrier	Blocks a pipeline. This synchronization operation needs to be inserted between the same pipelines with data dependency.
	DataSyncBarrier	Blocks the execution of subsequent instructions until all previous memory access instructions (the memory location to be waited for can be controlled by parameters) are executed.
	CrossCoreSetFlag	Sets the synchronization between AICs and AIVs in the separated architecture.
	CrossCoreWaitFlag	Waits for the synchronization between AICs and AIVs in the separated architecture.
Cache Processing	ICachePreLoad	Preloads instructions to the iCache from the DDR address where the instructions are located.
Cache Processing	GetICachePreloadStatus	Obtains the PreLoad status of the iCache.
System Variable Access	GetProgramCounter	Obtains the pointer to the program counter, which is used to record the current program execution position.
	GetSubBlockNum	Obtains the number of Vector Cores on the AI Core.
	GetSubBlockIdx	Obtains the ID of the Vector Core on the AI Core.
	GetSystemCycle	Obtains the number of cycles in the current system. If the number of cycles is converted to time (unit: μs), the frequency must be 50 MHz. The conversion formula is as follows: Time = (Number of cycles/50) μs.
Atomic Operations	SetAtomicMax	Sets whether to perform atomic comparison for subsequent data transferred from VECOUT to GM, which compares the content to be copied with the existing content in GM and writes the maximum value to GM.
	SetAtomicMin	Sets whether to perform atomic comparison for subsequent data transferred from VECOUT to GM, which compares the content to be copied with the existing content in GM and writes the minimum value to GM.
	SetStoreAtomicConfig	Sets the atomic operation enabling flag and type.
	GetStoreAtomicConfig	Obtains the value of the enabling flag and type of the atomic operation.
Resource Management	CubeResGroupHandle	Controls communication between the AIC and the AIV through software synchronization in the separated architecture to implement compute resource grouping of AI Core.
	GroupBarrier	Controls synchronization when two AIV tasks in the same CubeResGroupHandle object depend on each other.
	KfcWorkspace	Serves as the communication workspace descriptor, used to manage the message communication area division of different CubeResGroupHandle objects. It is used together with CubeResGroupHandle. The KfcWorkspace constructor is used to create a KfcWorkspace object.

Kernel API - High-Level APIs

**Table 10** Math library APIs
API	Function
Acos	Computes arc cosine element-wise.
Acosh	Computes inverse hyperbolic cosine element-wise.
Asin	Computes arcsine element-wise.
Asinh	Computes hyperbolic arcsine element-wise.
Atan	Computes arc tangent of a trigonometric function element-wise.
Atanh	Computes inverse hyperbolic tangent element-wise.
Axpy	Adds the product of each element of the source operand and a scalar to the corresponding element in the destination operand.
Ceil	Obtains the minimum integer value greater than or equal to x, that is, rounding towards positive infinity.
ClampMax	Replaces the number greater than scalar with scalar in srcTensor and retains the number less than or equal to scalar as the dstTensor output.
ClampMin	Replaces the number less than scalar with scalar in srcTensor and retains the number greater than or equal to scalar as the dstTensor output.
Cos	Computes cosine of a trigonometric function element-wise.
Cosh	Computes hyperbolic cosine element-wise.
CumSum	Accumulates data by row or column.
Digamma	Computes the logarithmic derivative of the gamma function of x element-wise.
Erf	Computes error function or Gaussian error function element-wise.
Erfc	Returns the complementary error function computing result of input x. The integral ranges from x to infinity.
Exp	Computes the natural exponent element-wise.
Floor	Obtains the minimum integer value less than or equal to x, that is, rounding towards negative infinity.
Fmod	Computes the remainder of two floating-point numbers element-wise.
Frac	Returns decimals element-wise.
Lgamma	Computes the absolute value and natural logarithm of the gamma function of x element-wise.
Log	Computes logarithm of bases e, 2, and 10 element-wise.
Power	Computes exponentiation element-wise.
Round	Rounds the input element to the nearest integer.
Sign	Performs the Sign operation element-wise. Sign refers to the symbol that returns the input data.
Sin	Computes sine element-wise.
Sinh	Computes hyperbolic sine element-wise.
Tan	Computes tangent element-wise.
Tanh	Performs logistic regression Tanh element-wise.
Trunc	Truncates floating point numbers element-wise, that is, rounding towards zero.
Xor	Performs the XOR operation element-wise.

**Table 11** Quantization and dequantization APIs
API	Function
AscendAntiQuant	Performs fake quantization by element. For example, apply fake quantization to convert the int8_t data type to the half type.
AscendDequant	Performs dequantization by element. For example, dequantize the int32_t data type to the half/float data type.
AscendQuant	Performs quantization by element. For example, quantize the half/float data type to the int8_t data type.

**Table 12** Data normalization APIs
API	Function
BatchNorm	Normalizes each input feature of samples in each batch along the batch dimension.
DeepNorm	Serves as a replacement for LayerNorm normalization during the training process of a deep neural network.
GroupNorm	Divides the input C dimension into groups (groupNum) and standardizes each group of data.
LayerNorm	Normalizes the input data of network layers to the [0, 1] range to standardize the distributions of both input and output data across network layers.
LayerNormGrad	Computes the backpropagation gradient of LayerNorm.
LayerNormGradBeta	Obtains the reverse beta/gmma value and outputs pdx, gmma, and beta when used in conjunction with LayerNormGrad.
Normalize	Computes the reciprocal rstd of the standard deviation of the input data with shape [A, R] and the normalized output y based on the known mean value and variance in LayerNorm.
RmsNorm	Normalizes input data whose shape is [B, S, H] using RmsNorm.
WelfordUpdate	Implements preprocessing of the Welford algorithm.
WelfordFinalize	Implements postprocessing of the Welford algorithm.

**Table 13** Activation function APIs
API	Function
AdjustSoftMaxRes	Performs postprocessing on SoftMax compute results and adjusts SoftMax compute results to specified values.
FasterGelu	Implements an activation function of the simplified FastGelu version.
FasterGeluV2	Implements an activation function of the FastGeluV2 version.
GeGLU	Serves as a GLU variant that uses GeLU as the activation function.
Gelu	Serves as an important activation function that is inspired by ReLU and dropout. The idea of random regular expression is introduced in activation.
LogSoftMax	Performs LogSoftmax computation on the input tensor.
ReGlu	Serves as a GLU variant that uses ReLU as the activation function.
Sigmoid	Performs logistic regression with Sigmoid element-wise.
Silu	Computes Silu element-wise.
SimpleSoftMax	Uses the computed sum and max data to perform softmax computation on the input tensor.
SoftMax	Performs softmax computation on input tensors by row.
SoftmaxFlash	Serves as the enhanced version of SoftMax, which not only performs softmaxflash computation on the input tensor, but updates the result of the current softmax computation based on the sum and max values obtained in the previous softmax computation.
SoftmaxFlashV2	Serves as the enhanced version of SoftmaxFlash, corresponding to the FlashAttention-2 algorithm.
SoftmaxGrad	Performs gradient backpropagation on input tensors.
SoftmaxGradFront	Performs gradient backpropagation on input tensors.
SwiGLU	Serves as a GLU variant that uses Swish as the activation function.
Swish	Serves as a Swish activation function in neural networks.

**Table 14** Reduction APIs
API	Function
Mean	Computes the mean of elements according to the direction of the last axis.
ReduceXorSum	Performs the XOR (bitwise XOR) operation by element and computes the sum of the results using ReduceSum.
Sum	Obtains the sum of elements in the last dimension.

**Table 15** Sorting APIs
API	Function
TopK	Obtains the first k maximum or minimum values of the last dimension and their corresponding indexes.
Concat	Preprocesses the data and merges the source operand srcLocal to be sorted into the target data concatLocal. After the data is preprocessed, you can sort the data.
Extract	Processes the sorting result data and outputs the sorted values and indexes.
Sort	Sorts data in descending order by value.
MrgSort	Merges at most four sorted lists into one. The results are sorted in descending order of the score fields.

**Table 16** Data padding APIs
API	Function
BroadCast	Broadcasts the input based on the output shape.
Pad	Pads the two-dimensional tensor (height x width) to 32-bytes alignment in the width direction.
UnPad	Unpads a two-dimensional tensor (height x width) in the width direction.

**Table 17** Data filtering APIs
API	Function
DropOut	Provides the function of filtering the source operand based on the mask tensor to obtain the destination operand.

**Table 18** Comparing and selecting APIs
API	Function
SelectWithBytesMask	Given two source operands src0 and src1, selects elements based on the values (non-bit) of corresponding positions of maskTensor to obtain the destination operand dst.

**Table 19** Deformation APIs
API	Function
ConfusionTranspose	Performs data format and reshape operations on the input data.

**Table 20** Index operation APIs
API	Function
ArithProgression	Returns an arithmetic progression given the start value, arithmetical value, and length.

**Table 21** Matmul APIs
API	Function
Matmul	Performs matrix multiplications.

**Table 22** HCCL APIs
API	Function
Hccl	Flexibly orchestrates collective communication tasks on the AI Core.

**Table 23** Tool APIs
API	Function
InitGlobalMemory	Initializes data in the global memory to a specified value.

Host API

**Table 24** Host APIs
Category	API	Function
Prototype registration and management	Prototype Registration API (OP_ADD)	Registers the prototype definition of an operator.
	OpDef	Defines the operator prototype.
	OpParamDef	Defines operator parameters.
	OpAttrDef	Defines operator attributes.
	OpAICoreDef	Defines the implementation information of the AI processor and associates the tiling implementation and shape inference functions.
	OpAICoreConfig	Configures AI Core information.
	OpMC2Def	Configures the communicator name of the MC2 operator on the host. After the configuration, the context address corresponding to the communicator can be obtained on the kernel.
Tiling data structure registration	TilingData Structure Definition	Defines a TilingData class and adds required member variables (TilingData fields) to store required TilingData parameters. After the TilingData class is defined, this class inherits the TilingDef class (base class for storing and processing user-defined Tiling structure member variables) to provide APIs for setting, serializing, and saving TilingData fields.
	TilingData Structure Registration	Registers the defined TilingData structure and binds it with a custom operator.
	ContextBuilder	Provides a series of APIs for you to manually build the TilingContext class to verify the tiling functions and the KernelContext class to verify the TilingParse functions.
	Template Argument Definition	Defines the template argument declaration ASCENDC_TPL_ARGS_DECL and template argument selection ASCENDC_TPL_ARGS_SEL (available template).
	GET_TPL_TILING_KEY	Automatically generates a TilingKey during tiling template programming. This API converts the passed template parameters into binary values based on the defined bit width, combines the binary values in sequence, and then converts the values into uint64, that is, TilingKey.
Platform information acquisition	PlatformAscendC	Obtains certain hardware platform information, such as the number of cores on a hardware platform for tiling computation to implement the Tiling function on the host. The PlatformAscendC class provides a function for obtaining such platform information.
Platform information acquisition	PlatformAscendCManager	Obtains hardware platform information, such as the number of cores on the hardware platform, to call operators in the basic mode (kernel launch) based on the kernel launch operator project. The PlatformAscendCManager class provides the function of obtaining platform information.

Operator Debugging API

**Table 25** Operator debugging APIs
API	Function
DumpTensor	Dumps the content of specified tensors for operators developed based on operator projects.
printf	Implements the formatted output function in CPU- or NPU-side debugging for operators developed based on operator projects.
assert	Implements the assert function in CPU/NPU for operators developed based on operator projects.
DumpAccChkPoint	Dumps the content of specified tensors for operators developed based on operator projects. This API can be used to print tensors at a specified offset position.
Trap	Stops the kernel when a software exception occurs.
GmAlloc	Creates shared memory during verification of the CPU-side operation of the kernel function. That is, creates a shared file in the /tmp directory and returns the mapping pointer to the file.
ICPU_RUN_KF	Functions as the CPU commissioning entry and completes calls to CPU operator programs during verification of the CPU-side operation of the kernel function.
ICPU_SET_TILING_KEY	Specifies tilingKey used for the current CPU debugging. During debugging, only the branch to which tilingKey corresponds in the operator kernel function is executed.
GmFree	Frees the shared memory allocated by GmAlloc during verification of the CPU-side operation of the kernel function.
SetKernelMode	Sets the kernel mode to the single AIV mode, single AIC mode, or MIX mode to enable CPU commissioning of single AIV (vector) operators, single AIC (cube) operators, or MIX operators, respectively.
TRACE_START	Performs dotting in any running phase of the operator when the CAModel is used for operator performance simulation, to analyze the pipeline diagrams of different instructions for further performance tuning. Indicates dotting from the start point. This API is used together with TRACE_STOP.
TRACE_STOP	Performs dotting in any running phase of the operator when the CAModel is used for operator performance simulation, to analyze the pipeline diagrams of different instructions for further performance tuning. Indicates dotting from the end point. This API is used together with TRACE_START.
MetricsProfStart	Starts the profile data collection. This API is used together with MetricsProfStop. When using msProf for operator on-board tuning, you can call MetricsProfStart and MetricsProfStop before and after the code segment on the kernel to specify the scope of the code segment to be tuned.
MetricsProfStop	Stops the profile data collection. This API is used together with MetricsProfStart. When using msProf for operator on-board tuning, you can call MetricsProfStart and MetricsProfStop before and after the code segment on the kernel to specify the scope of the code segment to be tuned.