Ascend C APIs

Ascend C provides a group of class library APIs. You can use the standard C++ syntax and class library APIs for programming. Ascend C programming class library APIs are classified into the following types:

Basic data structure: Basic data structures used in kernel APIs, such as GlobalTensor and LocalTensor.
Basic APIs: implement abstract hardware capabilities and open chip capabilities to ensure completeness and compatibility. APIs marked as Instruction Set Architecture Special Interface (ISASI, hardware architecture-related APIs) do not guarantee compatibility across hardware versions.
High-level APIs: implement common computing algorithms to improve programming and development efficiency based on basic APIs. High-level APIs include math library, Matmul, Softmax, and others, and ensure compatibility.
Utils API (common auxiliary function): provides various common tool classes, covering functions such as standard library, platform information acquisition, runtime compilation, and log output, helping developers efficiently develop operators and optimize performance.

Basic Data Structure

**Table 1** Basic data structure list
API	Description
LocalTensor	LocalTensor is used to store data in the local memory of the AI Core. It supports the logical positions VECIN, VECOUT, VECCALC, A1, A2, B1, B2, CO1, and CO2.
GlobalTensor
Coordinate	Coordinate is essentially a tuple (tuple), and is used to indicate location information of a tensor in different dimensions, that is, a coordinate value.
Layout	The Layout<Shape, Stride> data structure is a basic template class that describes the memory layout of multi-dimensional tensors. It maps the logical coordinate space to the one-dimensional memory address space based on the shape and stride information at compile time, providing basic support for complex tensor operations and hardware optimization.
TensorTrait	The TensorTrait data structure is a basic template class that describes tensor information, including the data type, logical location, and layout memory layout of the tensor.

Basic APIs

**Table 2** Scalar computation APIs
API	Function
ScalarGetCountOfValue	Obtains the number of 0s or 1s in a binary number of the uint64_t type.
ScalarCountLeadingZero	Computes the number of leading 0s of a uint64_t number (number of 0s from the most significant bit to the first 1 in the binary number).
ScalarCast	Converts the type of a scalar to a specified type.
CountBitsCntSameAsSignBit	Computes the number of consecutive bits that are the same as the sign bit from the most significant bit in the binary number of the uint64_t type.
ScalarGetSFFValue	Obtains the location where the first 0 or 1 appears in a binary number of the uint64_t type.
ToBfloat16	Converts scalar data of the float type to scalar data of the bfloat16_t type.
ToFloat	Converts scalar data of the bfloat16_t type to scalar data of the float type.

**Table 3** Vector computation APIs
Category	API	Function
Basic arithmetic	Exp	Computes the natural exponent based on elements.
	Ln	Computes the natural logarithm based on elements.
	Abs	Computes the absolute value based on elements.
	Reciprocal	Computes the reciprocal based on elements.
	Sqrt	Extracts the square root based on elements.
	Rsqrt	Computes the reciprocal after square root extraction based on elements.
	Relu	Performs a ReLU operation based on elements.
	Add	Performs addition based on elements.
	Sub	Performs subtraction based on elements.
	Mul	Performs multiplication based on elements.
	Div	Performs division based on elements.
	Max	Computes the maximum value based on elements.
	Min	Computes the minimum value based on elements.
	Adds	Performs addition between a scalar and a vector element-wise.
	Muls	Performs multiplication between a scalar and a vector element-wise.
	Maxs	Compares the vector source operand and a scalar element-wise and chooses the maximum.
	Mins	Compares the vector source operand and a scalar element-wise and chooses the minimum.
	LeakyRelu	Computes Leaky ReLU on the source operand element-wise.
Logic-based computation	Not	Performs bitwise Not based on elements.
	And	Performs a bitwise AND operation based on elements.
	Or	Performs a bitwise OR operation based on elements.
	ShiftLeft	Performs left shift on the source operand element-wise. The shift distance is determined by scalarValue.
	ShiftRight	Performs right shift on the source operand element-wise. The shift distance is determined by scalarValue.
Compound computation	Axpy	Adds the product of each element in the source operand and a scalar to the corresponding element in the destination operand.
	CastDeq	Quantizes the input and converts the precision.
	AddRelu	Adds inputs element-wise and chooses the larger between the result and 0.
	AddReluCast	Adds inputs element-wise and chooses the larger between the result and 0, and converts precision based on the data types of the source and destination operand tensors.
	AddDeqRelu	Adds inputs element-wise, performs Deq quantization on the result, and then performs ReLU calculation on the result (obtains the larger between the result and 0).
	SubRelu	Computes the difference element-wise and chooses the larger between the result and 0.
	SubReluCast	Computes the difference element-wise and chooses the larger between the result and 0, and converts precision based on the data types of the source and destination operand tensors.
	MulAddDst	Multiplies src0Local and src1Local element-wise, adds them to dstLocal, and saves the final result to dstLocal.
	MulCast	Performs multiplication based on elements and converts precision based on the data types of the source and destination operand tensors.
	FusedMulAdd	Multiplies src0Local and dstLocal element-wise, adds src1Local, and saves the result to dstLocal.
	FusedMulAddRelu	Multiplies src0Local and dstLocal element-wise, adds them to src1Local, chooses the larger between the result and 0, and saves the final result to dstLocal.
Comparison and selection	Compare	Compares the sizes of two tensors element by element. If the comparison result is true, the corresponding bit of the output result is 1. Otherwise, the bit is 0.
	Compare (Result Stored in a Register)	Compares the sizes of two tensors element by element. If the comparison result is true, the corresponding bit of the output result is 1. Otherwise, the bit is 0. This interface can be used when the mask parameter is required. The result is stored in a register.
	CompareScalar	Compares the sizes of an element in a tensor with that of a scalar element by element. If the comparison result is true, the corresponding bit of the output result is 1. Otherwise, the bit is 0.
	Select	Selects the source operand src0 or src1 based on the bit value of selMask (mask used for selection) to obtain the destination operand dst. When the bit value of selMask is 1, src0 is selected. When the bit value of selMask is 0, src1 is selected.
	GatherMask	Selects elements from the source operand and writes them to the destination operand based on a gather mask (for data collection) that corresponds to either the binary of the built-in fixed mode or the binary of the user-defined input tensor values.
Precision Conversion Instructions	Cast	Converts precision based on the data types of the source and destination operand tensors.
Reduction computation	ReduceMax	Obtains the maximum value and its corresponding index position among the input data.
	ReduceMin	Obtains the minimum value and its corresponding index position among the input data.
	ReduceSum	Sums up all input data.
	WholeReduceMax	Computes the maximum value and index of all data in each repeat.
	WholeReduceMin	Computes the minimum value and index of all data in each repeat.
	WholeReduceSum	Sums all data in each repeat.
	BlockReduceMax	Calculates the maximum value of all elements in each repeat.
	BlockReduceMin	Calculates the minimum value of all elements in each repeat.
	BlockReduceSum	Sums up all elements in each repeat. Source operands are added in binary tree mode.
	PairReduceSum	Sums two adjacent (odd and even) elements.
	RepeatReduceSum	Sums all data in each repeat. Compared with WholeReduceSum, it does not support the bitwise mask mode. You are advised to use WholeReduceSum with more comprehensive functions.
Data Conversion	Transpose	Performs transpose on data blocks of a 16 x 16 2D matrix, and conversion between [N,C,H,W] and [N,H,W,C].
Data Conversion	TransDataTo5HD	Converts the NCHW format to the NC1HWC0 format. It can also be used for transposing a two-dimensional matrix data block.
Data Padding	Duplicate	Copies a variable or an immediate for multiple times and fill it in the vector.
	Brcb	Extracts eight elements from a given input tensor each time and fills them in eight data blocks (32 bytes) in the result tensor. Each element corresponds to a data block.
	CreateVecIndex	Creates the vector index with firstValue as the start value.
Data Scatter/Data Gather	Gather	Gathers given input tensors by element to the result tensor based on the offset address tensor provided.
Mask Operations	SetMaskCount	Sets mask to counter mode. In this mode, you do not need to perceive the number of iterations or process unaligned tail blocks. You can directly pass in the amount of data to be computed. The actual number of iterations is automatically inferred by the Vector Unit.
	SetMaskNorm	Sets mask to normal mode. This mode is the default mode. You can configure the number of iterations.
	SetVectorMask	Sets mask during Vector computation.
	ResetMask	Restores the mask value to the default (all 1s), indicating that all elements in each iteration participate in the Vector computation.
Quantization Settings	SetDeqScale	Sets the value of the DEQSCALE register.

**Table 4** Data movement APIs
API	Function
DataCopy	Performs data movement, including common data movement, enhanced data movement, tiled data movement, and associated format conversion.
Copy	Performs the movement instruction between VECIN, VECCALC, and VECOUT, and supports the mask operation and data block interval operation.

**Table 5** Resource management APIs
API	Function
TPipe	Manages resources such as the Global Memory. It allocates and manages resources such as memory.
GetTPipePtr	Obtains the TPipe pointer for the Global Memory managed by the framework. After obtaining the pointer, you can perform TPipe-related operations.
TBufPool	Manually manages or reuses the Unified Buffer/L1 Buffer physical memory. It is mainly used when the Unified Buffer/L1 Buffer physical memory is insufficient in multi-stage computing.
TQue	Performs EnQue and DeQue operations, and implements inter-task synchronization through queues.
TQueBind	Binds the source and destination logical locations to determine the memory allocation location and insert the corresponding synchronization event, solving problems such as memory allocation, management, and synchronization.
TBuf	Manages the memory occupied by some temporary variables used during Ascend C programming.
InitSpmBuffer	Initializes the SPM buffer.
WriteSpmBuffer	Copies the data to be overflowed and temporarily stored to the SPM buffer.
ReadSpmBuffer	Reads data from the SPM buffer back to the local data.
GetUserWorkspace	Obtains the workspace pointer used by the user.
SetSysWorkSpace	Sets the pointer to the system workspace, as the system workspace is used by the framework communication mechanism during fused operator programming.
GetSysWorkSpacePtr	Obtains the pointer to the system workspace.

**Table 6** Synchronization control APIs
API	Description
TQueSync	Provides synchronization control. You can use this type of APIs to implement synchronization control.
IBSet	Synchronizes the AI Cores to avoid data dependency problems such as write-after-read, read-after-write, and write-after-write when different AI Cores operate the same global memory block. IBSet is called to set the flag bit of a core. IBSet and IBWait are used in pairs to indicate the synchronous waiting instruction between cores, waiting for the completion of a core operation.
IBWait	When different AI Cores operate the same global memory block, this function can be called to synchronize the AI Cores to avoid data dependency problems such as write-after-read, read-after-write, and write-after-write. IBWait and IBSet are used in pairs to indicate the synchronous waiting instruction between cores, waiting for the completion of a core operation.
SyncAll	Synchronizes the AI Cores to avoid data dependency problems such as write-after-read, read-after-write, and write-after-write when different AI Cores operate the same global memory block. Currently, multi-core synchronization is classified into hardware synchronization and software synchronization. Hardware synchronization uses the full-core synchronization instruction of the hardware to ensure multi-core synchronization. Software synchronization is implemented through software algorithm simulation.
InitDetermineComputeWorkspace	Initializes the value of the GM shared memory. WaitPreBlock and NotifyNextBlock can be called only after the initialization is complete.
WaitPreBlock	Reads the value in the GM address to determine whether to continue to wait. When the GM value meets the waiting condition of the current core, the core can proceed to the next operation.
NotifyNextBlock	Writes the GM address to notify the next core that the operation of the current core is completed and the next core can perform the operation.
SetNextTaskStart	is called in the sub-kernel of SuperKernel. The called instruction can be implemented in parallel with other sub-kernel, improving the overall performance.
WaitPreTaskEnd	is called in the sub-kernel of SuperKernel. The instructions before calling can be implemented in parallel with other sub-kernel to improve the overall performance.

**Table 7** Cache processing APIs
API	Function
DataCachePreload	Preloads data from the specific DDR address where the source address is located to the data cache.
DataCacheCleanAndInvalid	Refreshes the cache to ensure cache consistency.

**Table 8** System variable access API
API	Function
GetBlockNum	Obtains the number of blocks configured for the current task, which is used for multi-core logic control in the code.
GetBlockIdx	Obtains the index of the current core, which is used for multi-core logic control and multi-core offset computation in the code.
GetDataBlockSizeInBytes	Obtains the size (in byte) of a data block of the current chip version. You can calculate the values of the repeatTime, DataBlock Stride, and Repeat Stride parameters to be transferred in the API instruction based on the size of the data block.
GetArchVersion	Obtains the version number of the current AI processor architecture.
InitSocState	The AI Core has some global states, such as the atomic accumulation state and mask mode. During actual running, these values may be modified by the operators executed in the previous sequence, resulting in unexpected computation. In the static tensor programming scenario, you must call this function at the kernel entry to initialize the AI Core state.

**Table 9** Atomic operation APIs
API	Function
SetAtomicAdd	Sets whether to perform atomic addition for data transfer from VECOUT to GM, from L0C to GM, or from L1 to GM. The addition data type can be set based on different parameters.
SetAtomicType	Sets different atomic operation data types using template parameters.
SetAtomicNone	Clears the status of an atomic operation.

**Table 10** List of debugging APIs
API	Function
DumpTensor	Dumps the content of specified tensors for operators developed based on operator projects.
printf	Implements the formatted output function in CPU- or NPU-side debugging for operators developed based on operator projects.
ascendc_assert	ascendc_assert provides an API for implementing the assertion function in the CPU or NPU domain. When the assertion condition is not met, the system outputs the assertion information and prints it in a formatted manner on the screen.
assert	Implements the assert function in CPU/NPU for operators developed based on operator projects.
DumpAccChkPoint	Dumps the content of specified tensors for operators developed based on operator projects. This API can be used to print tensors at a specified offset position.
PrintTimeStamp
Trap	Stops the kernel when a software exception occurs.
GmAlloc	Creates shared memory during verification of the CPU-side operation of the kernel function. That is, creates a shared file in the /tmp directory and returns the mapping pointer to the file.
ICPU_RUN_KF	Functions as the CPU commissioning entry and completes calls to CPU operator programs during verification of the CPU-side operation of the kernel function.
ICPU_SET_TILING_KEY	Specifies tilingKey used for the current CPU debugging. During debugging, only the branch to which tilingKey corresponds in the operator kernel function is executed.
GmFree	Frees the shared memory allocated by GmAlloc during verification of the CPU-side operation of the kernel function.
SetKernelMode	Sets the kernel mode to the single AIV mode, single AIC mode, or MIX mode to enable CPU commissioning of single AIV (vector) operators, single AIC (cube) operators, or MIX operators, respectively.
TRACE_START	Performs dotting in any running phase of the operator when the CAModel is used for operator performance simulation, to analyze the pipeline diagrams of different instructions for further performance tuning. Indicates dotting from the start point. This API is used together with TRACE_STOP.
TRACE_STOP	Performs dotting in any running phase of the operator when the CAModel is used for operator performance simulation, to analyze the pipeline diagrams of different instructions for further performance tuning. Indicates dotting from the end point. This API is used together with TRACE_START.
MetricsProfStart	Starts the profile data collection. This API is used together with MetricsProfStop. When using msProf for operator on-board tuning, you can call MetricsProfStart and MetricsProfStop before and after the code segment on the kernel to specify the scope of the code segment to be tuned.
MetricsProfStop	Stops the profile data collection. This API is used together with MetricsProfStart. When using msProf for operator on-board tuning, you can call MetricsProfStart and MetricsProfStop before and after the code segment on the kernel to specify the scope of the code segment to be tuned.

**Table 11** Tool function APIs
API	Description
Async	Async provides a unified API for executing specific functions in different modes (AIC or AIV), thereby avoiding direct hardware condition judgment in code (such as using ASCEND_IS_AIV or ASCEND_IS_AIC).
GetTaskRatio	Obtains the Cube/Vector ratio, which is applicable to the Cube/Vector separation mode.

**Table 12** Kernel tiling APIs
API	Function
GET_TILING_DATA	Obtains the tiling information input by the kernel entry point function of the operator and fills the information in the registered tiling structure. This function is compiled in macro expansion mode. If a user has registered multiple TilingData structures, this API is used to return the default registered structure.
GET_TILING_DATA_WITH_STRUCT	Specifies a structure name to obtain the specified tiling information and fill the information in the corresponding tiling structure. This function is built in macro expansion mode.
GET_TILING_DATA_MEMBER	Obtains the member variables of a tiling structure.
TILING_KEY_IS	Checks whether the tiling_key in the current kernel function execution is equal to a specific key, so as to identify a kernel branch with tiling_key==key.
REGISTER_TILING_DEFAULT	Registers the default TilingData structure defined by the user using the standard C++ syntax on the kernel.
REGISTER_TILING_FOR_TILINGKEY	Registers a custom TilingData structure that matches the TilingKey on the kernel. This API needs to provide a logical expression, which uses the string TILING_KEY_VAR to indicate the actual TilingKey and the range that the TilingKey meets.
REGISTER_NONE_TILING	When the TilingData structure customized using the standard C++ syntax is used in the kernel, if you are not sure about which structures need to be registered, you can use this API to notify the framework that the standard C++ syntax that is not registered is used to define TilingData. In addition, GET_TILING_DATA_WITH_STRUCT, GET_TILING_DATA_MEMBER, and GET_TILING_DATA_PTR_WITH_STRUCT are used to obtain the corresponding TilingData.
KERNEL_TASK_TYPE_DEFAULT	Sets the global default kernel type, which applies to all tiling keys.
KERNEL_TASK_TYPE	Sets the kernel type corresponding to a specific tiling key.

**Table 13** ISASI APIs
Category	API	Function
Vector Computation	VectorPadding	Performs the padding operation on the source operand by the data block based on padMode and padSide.
	BilinearInterpolation	Performs bilinear interpolation operations, including vertical iteration and horizontal iteration.
	GetCmpMask	Obtains the comparison result of the Compare (Result Stored in a Register) instruction.
	SetCmpMask	Sets the comparison register for the APIs where Select does not specify the mask parameter.
	GetAccVal	Obtains the computation result of ReduceSum (based on the first n data elements of a tensor).
	GetReduceMaxMinCount	Obtains the maximum/minimum values and the corresponding index values in the scenario where ReduceMax and ReduceMin are consecutive.
	ProposalConcat	Inserts consecutive elements into the corresponding positions in Region Proposals. In each iteration, 16 consecutive elements are inserted into the corresponding positions in 16 Region Proposals.
	ProposalExtract	Extracts elements from corresponding positions in Region Proposals and rearranges them. In each iteration, 16 elements are extracted from 16 Region Proposals and arranged consecutively. The functionality of this API is the opposite of that of ProposalConcat.
	RpSort16	Sorts the Region Proposals based on their score fields in descending order. 16 Region Proposals are sorted in each iteration.
	MrgSort4	Merges at most four sorted Region Proposal lists into one. The results are sorted in descending order of the score fields.
	Sort32	Serves as a sorting function that can sort a maximum of 32 elements in each iteration.
	MrgSort	Merges at most four sorted lists into one. The results are sorted in descending order of the score fields.
	GetMrgSortResult	Obtains the number of region proposals in the queue processed by MrgSort or MrgSort4 and stores the number in the four List arguments in sequence.
	Gatherb	Gathers a given input tensor to the result tensor based on the offset address tensor provided.
	Scatter	Generates a new result tensor based on a given continuous input tensor, a destination address offset tensor, and the offset address, and distributes the input tensor to the result tensor.
Data Movement	DataCopyPad	Enables data non-aligned movement.
Data Movement	SetPadValue	Sets the value filled by DataCopyPad.
Cube Computation	Mmad	Performs matrix multiplication and addition.
	MmadWithSparse	Performs matrix multiplication and addition operations. The input left matrix A is a sparse matrix, and the input right matrix B is a dense matrix.
	SetHF32Mode	Sets register values, which is similar to SetHF32TransMode and SetMMLayoutTransform. SetHF32Mode is used to set the HF32 mode of the MMAD.
	SetHF32TransMode	Sets register values, which is similar to SetHF32Mode and SetMMLayoutTransform. SetHF32TransMode is used to set the HF32 rounding mode of the MMAD. It is valid only when the HF32 mode of the MMAD takes effect.
	SetMMLayoutTransform	Sets register values, which is similar to SetHF32Mode and SetHF32TransMode. SetMMLayoutTransform is used to set the M/N direction of the MMAD.
	Conv2D	Performs 2D convolution on a given input tensor and a weight tensor and outputs a result tensor. The Conv2d convolution layer is mostly used for image recognition, and a filter is used to extract features in an image.
	Gemm	Multiplies two tensors and outputs a result tensor. Multiply matrix A and matrix B to obtain matrix C, and output matrix C.
	SetFixPipeConfig	Sets the tensor quantization parameters in the real-time quantization during DataCopy (CO1->GM or CO1->A1).
	SetFixpipeNz2ndFlag	Sets the NZ2ND configuration in the real-time format conversion (NZ2ND) during DataCopy (CO1 -> GM or CO1 -> A1).
	SetFixpipePreQuantFlag	Sets the scalar quantization parameters in the real-time quantization during DataCopy (CO1 -> GM or CO1 -> A1).
	SetFixPipeClipRelu	Sets the maximum value of the ClipReLU operation after real-time quantization is performed during DataCopy (CO1 -> GM).
	SetFixPipeAddr	Sets the address of LocalTensor during the element-wise operation after real-time quantization is performed during DataCopy (CO1 -> GM).
	InitConstValue	Initializes LocalTensor (TPosition: A1, A2, B1, or B2) to a specific value.
	LoadData	Provides the Load2D and Load3D data loading functions.
	LoadDataWithTranspose	Loads 2D data with transposing from A1/B1 to A2/B2.
	SetAippFunctions	Sets AI preprocessing (AIPP) parameters for images.
	LoadImageToLocal	Transfers image data from the GM to A1/B1. During the transfer, you can preprocess images, including image flipping, image resizing (clipping, cropping, scaling, and stretching), color space conversion (CSC), and type conversion.
	LoadUnZipIndex	Loads the compression index table on the GM to the internal register.
	LoadDataUnzip	Decompress the data on the GM and transfer the data to A1, B1, and B2.
	LoadDataWithSparse	Moves the 512-byte dense weight matrix stored in B1 to B2, and reads the 128-byte index matrix for sparseness of the dense matrix.
	SetFmatrix	Sets the attribute description of the feature map when Load3Dv1/Load3Dv2 is called.
	SetLoadDataBoundary	Sets A1/B1 boundary value when Load3D is called.
	SetLoadDataRepeat	Sets the repeat parameter of the Load3Dv2 API. After the repeat parameter is set, the Load3Dv2 API can be called once to complete the data movement for multiple iterations.
	SetLoadDataPaddingValue	Sets padValue for Load3Dv1/Load3Dv2.
	Fixpipe	Processes the result after the matrix computation is complete. For example, the computation result is quantized and the data is moved from CO1 to the Global Memory.
Synchronization Control	SetFlag/WaitFlag	Synchronizes different pipelines in the same core. This synchronization operation needs to be inserted between different pipeline instructions with data dependency.
	PipeBarrier	Blocks a pipeline. This synchronization operation needs to be inserted between the same pipelines with data dependency.
	DataSyncBarrier	Blocks the execution of subsequent instructions until all previous memory access instructions (the memory location to be waited for can be controlled by parameters) are executed.
	CrossCoreSetFlag	Synchronization instruction between the Cube Unit (AIC) and Vector Unit (AIV) on the AI Core in separated mode.
	CrossCoreWaitFlag	Synchronization wait instruction between the Cube Unit (AIC) and Vector Unit (AIV) on the AI Core in separated mode.
Cache Processing	ICachePreLoad	Preloads instructions to the iCache from the DDR address where the instructions are located.
Cache Processing	GetICachePreloadStatus	Obtains the PreLoad status of the iCache.
System Variable Access	GetProgramCounter	Obtains the pointer to the program counter, which is used to record the current program execution position.
	GetSubBlockNum	Obtains the number of Vector Cores on the AI Core.
	GetSubBlockIdx	Obtains the ID of the Vector Core on the AI Core.
	GetSystemCycle	Obtains the number of cycles in the current system. If the number of cycles is converted to time (unit: μs), the frequency must be 50 MHz. The conversion formula is as follows: Time = (Number of cycles/50) μs.
Atomic Operations	SetAtomicMax	Sets whether to perform atomic comparison for subsequent data transferred from VECOUT to GM, which compares the content to be copied with the existing content in GM and writes the maximum value to GM.
	SetAtomicMin	Sets whether to perform atomic comparison for subsequent data transferred from VECOUT to GM, which compares the content to be copied with the existing content in GM and writes the minimum value to GM.
	SetStoreAtomicConfig	Sets the atomic operation enabling flag and type.
	GetStoreAtomicConfig	Obtains the value of the enabling flag and type of the atomic operation.
Debug Ports	CheckLocalMemoryIA	Monitors the UB read and write operations within the specified range. If the UB read and write operations within the specified range are monitored, an EXCEPTION error is reported. If the UB read and write operations within the specified range are not monitored, no error is reported.
Cube group management	CubeResGroupHandle	CubeResGroupHandle is used to control the communication between the AI Core and AI Vector in split mode through software synchronization, implementing AI Core computing resource grouping.
	GroupBarrier	Controls synchronization when two AIV tasks in the same CubeResGroupHandle object depend on each other.
	KfcWorkspace	Manages the message communication area division of different CubeResGroupHandle objects. It is a communication workspace descriptor and is used together with CubeResGroupHandle. The KfcWorkspace constructor is used to create a KfcWorkspace object.

High-Level APIs

**Table 14** Mathematical computation APIs
API	Function
Acos	Computes arc cosine element-wise.
Acosh	Computes inverse hyperbolic cosine element-wise.
Asin	Computes arcsine element-wise.
Asinh	Computes hyperbolic arcsine element-wise.
Atan	Computes arc tangent of a trigonometric function element-wise.
Atanh	Computes inverse hyperbolic tangent element-wise.
Axpy	Adds the product of each element of the source operand and a scalar to the corresponding element in the destination operand.
Ceil	Obtains the minimum integer value greater than or equal to x, that is, rounding towards positive infinity.
ClampMax	Replaces the number greater than scalar with scalar in srcTensor and retains the number less than or equal to scalar as the dstTensor output.
ClampMin	Replaces the number less than scalar with scalar in srcTensor and retains the number greater than or equal to scalar as the dstTensor output.
Cos	Computes cosine of a trigonometric function element-wise.
Cosh	Computes hyperbolic cosine element-wise.
CumSum	Accumulates data by row or column.
Digamma	Computes the logarithmic derivative of the gamma function of x element-wise.
Erf	Computes error function or Gaussian error function element-wise.
Erfc	Returns the complementary error function computing result of input x. The integral ranges from x to infinity.
Exp	Computes the natural exponent element-wise.
Floor	Obtains the minimum integer value less than or equal to x, that is, rounding towards negative infinity.
Fmod	Computes the remainder of two floating-point numbers element-wise.
Frac	Returns decimals element-wise.
Lgamma	Computes the absolute value and natural logarithm of the gamma function of x element-wise.
Log	Computes logarithm of bases e, 2, and 10 element-wise.
Power	Computes exponentiation element-wise.
Round	Rounds the input element to the nearest integer.
Sign	Performs the Sign operation element-wise. Sign refers to the symbol that returns the input data.
Sin	Computes sine element-wise.
Sinh	Computes hyperbolic sine element-wise.
Tan	Computes tangent element-wise.
Tanh	Performs logistic regression Tanh element-wise.
Trunc	Truncates floating point numbers element-wise, that is, rounding towards zero.
Xor	Performs the XOR operation element-wise.

**Table 15** Quantization operation APIs
API	Function
AscendAntiQuant	Performs fake quantization by element. For example, apply fake quantization to convert the int8_t data type to the half type.
AscendDequant	Performs dequantization by element. For example, dequantize the int32_t data type to the half/float data type.
AscendQuant	Performs quantization by element. For example, quantize the half/float data type to the int8_t data type.

**Table 16** Normalization operation APIs
API	Function
BatchNorm	Normalizes each input feature of samples in each batch along the batch dimension.
DeepNorm	Serves as a replacement for LayerNorm normalization during the training process of a deep neural network.
GroupNorm	Divides the input C dimension into groups (groupNum) and standardizes each group of data.
LayerNorm	Normalizes the input data of network layers to the [0, 1] range to standardize the distributions of both input and output data across network layers.
LayerNormGrad	Computes the backpropagation gradient of LayerNorm.
LayerNormGradBeta	Obtains the reverse beta/gmma value and outputs pdx, gmma, and beta when used in conjunction with LayerNormGrad.
Normalize	Computes the reciprocal rstd of the standard deviation of the input data with shape [A, R] and the normalized output y based on the known mean value and variance in LayerNorm.
RmsNorm	Normalizes input data whose shape is [B, S, H] using RmsNorm.
WelfordUpdate	Implements preprocessing of the Welford algorithm.
WelfordFinalize	Implements postprocessing of the Welford algorithm.

**Table 17** Activation function APIs
API	Function
AdjustSoftMaxRes	Performs postprocessing on SoftMax compute results and adjusts SoftMax compute results to specified values.
FasterGelu	Implements an activation function of the simplified FastGelu version.
FasterGeluV2	Implements an activation function of the FastGeluV2 version.
GeGLU	Serves as a GLU variant that uses GeLU as the activation function.
Gelu	Serves as an important activation function that is inspired by ReLU and dropout. The idea of random regular expression is introduced in activation.
LogSoftMax	Performs LogSoftmax computation on the input tensor.
ReGlu	Serves as a GLU variant that uses ReLU as the activation function.
Sigmoid	Performs logistic regression with Sigmoid element-wise.
Silu	Computes Silu element-wise.
SimpleSoftMax	Uses the computed sum and max data to perform softmax computation on the input tensor.
SoftMax	Performs softmax computation on input tensors by row.
SoftmaxFlash	Serves as the enhanced version of SoftMax, which not only performs softmaxflash computation on the input tensor, but updates the result of the current softmax computation based on the sum and max values obtained in the previous softmax computation.
SoftmaxFlashV2	Serves as the enhanced version of SoftmaxFlash, corresponding to the FlashAttention-2 algorithm.
SoftmaxFlashV3	Serves as the enhanced version of SoftmaxFlash, corresponding to the Softmax PASA algorithm.
SoftmaxGrad	Performs gradient backpropagation on input tensors.
SoftmaxGradFront	Performs gradient backpropagation on input tensors.
SwiGLU	Serves as a GLU variant that uses Swish as the activation function.
Swish	Serves as a Swish activation function in neural networks.

**Table 18** Reduction APIs
API	Function
Sum	Obtains the sum of elements in the last dimension.
Mean	Computes the mean of elements according to the direction of the last axis.
ReduceXorSum	Performs the XOR (bitwise XOR) operation by element and computes the sum of the results using ReduceSum.
ReduceSum	Accumulates data of a multi-dimensional vector based on a specified dimension.
ReduceMean	Calculates the average value of a multi-dimensional vector along a specified dimension.
ReduceMax	Returns the maximum value of a multi-dimensional vector in a specified dimension.
ReduceMin	Returns the minimum value of a multi-dimensional vector in a specified dimension.
ReduceAny	Calculates the logical OR of a multi-dimensional vector along a specified dimension.
ReduceAll	Calculates the logical AND of a multi-dimensional vector along a specified dimension.
ReduceProd	Calculates the product of a multi-dimensional vector along a specified dimension.

**Table 19** Sorting operation APIs
API	Function
TopK	Obtains the first k maximum or minimum values of the last dimension and their corresponding indexes.
Concat	Preprocesses the data and merges the source operand srcLocal to be sorted into the target data concatLocal. After the data is preprocessed, you can sort the data.
Extract	Processes the sorting result data and outputs the sorted values and indexes.
Sort	Sorts data in descending order by value.
MrgSort	Merges at most four sorted lists into one. The results are sorted in descending order of the score fields.

**Table 20** Data filtering APIs
API	Function
Select	Given two source operands src0 and src1, selects elements based on the values (non-bit) of corresponding positions of maskTensor to obtain the destination operand dst.
DropOut	Provides the function of filtering the source operand based on the mask tensor to obtain the destination operand.

**Table 21** Tensor transformation APIs
API	Function
Transpose	Performs data format and reshape operations on the input data.
TransData	Converts the layout format of the input data to the target layout format.
Broadcast	Broadcasts the input based on the output shape.
Pad	Pads the two-dimensional tensor (height x width) to 32-bytes alignment in the width direction.
UnPad	Unpads a two-dimensional tensor (height x width) in the width direction.
Fill	Initializes data in the global memory to a specified value.

**Table 22** Index calculation APIs
API	Function
Arange	Returns an arithmetic progression given the start value, arithmetical value, and length.

**Table 23** Matrix calculation APIs
API	Function
Matmul	Performs matrix multiplications.

**Table 24** HCCL communication APIs
API	Function
HCCL communication	Orchestrates collective communication tasks on the AI Core.

**Table 25** Convolution calculation APIs
API	Description
Conv3D	Performs forward 3D convolution matrix operation.
Conv3DBackpropInput	Performs the backward convolution operation to calculate the backpropagation error of the feature matrix.
Conv3DBackpropFilter	Performs the backward convolution operation to calculate the backpropagation error of the weight.

Utils API

**Table 26** C++ standard library APIs
API	Description
max	Compares two operands of the same data type and returns the larger value.
min	Compares two operands of the same data type and returns the smaller value.
integer_sequence	Generates an integer sequence.
tuple	Stores multiple elements of different types as a container.
get	Extracts elements from the tuple container at a specified position.
make_tuple	Creates a tuple object conveniently.
is_convertible	Determines whether implicit conversion can be performed between two types during program build.
is_base_of	Determines whether a type is a base class of another type during program build.
is_same	Determines whether two types are the same during program build.
enable_if	Enables or disables a specific function template, class template, or template specialization based on a condition during program build.
conditional	Selects one of two types based on a Boolean condition during program build.
integral_constant	Encapsulates a compile-time constant integer value, which is the fundamental component of many type traits and compile-time computations in the standard library.

**Table 27** APIs for obtaining platform information
API	Description
PlatformAscendC	To implement the Tiling function on the host, certain hardware platform information, such as the number of cores on a hardware platform, may be required for Tiling calculation. The PlatformAscendC class provides a function for obtaining such platform information.
PlatformAscendCManager	Obtains hardware platform information, such as the number of cores on the hardware platform, to call operators in the basic mode (kernel launch) based on the kernel launch operator project. The PlatformAscendCManager class provides the function of obtaining platform information.

**Table 28** APIs for prototype registration and management
API	Description
Prototype Registration API (OP_ADD)	Registers the prototype definition of an operator.
OpDef	Defines the operator prototype.
OpParamDef	Defines operator parameters.
OpAttrDef	Defines operator attributes.
OpAICoreDef	Defines the implementation information of the AI Processor and associates the tiling implementation and shape inference functions.
OpAICoreConfig	Configures AI Core information.
OpMC2Def	Configures the communicator name of the MC2 operator on the host. After the configuration, the context address corresponding to the communicator can be obtained on the kernel.

**Table 29** APIs for registering the tiling data structure
API	Description
TilingData Structure Definition	Defines a TilingData class and adds required member variables (TilingData fields) to store required TilingData parameters. After the TilingData class is defined, this class inherits the TilingDef class (base class for storing and processing user-defined Tiling structure member variables) to provide APIs for setting, serializing, and saving TilingData fields.
TilingData Structure Registration	Registers the defined TilingData structure and binds it with a custom operator.

**Table 30** Tiling debugging APIs
API	Description
OpTilingRegistry	The OpTilingRegistry class belongs to the context_ascendc namespace. It is used to load the dynamic library of tiling implementation and obtain the tiling function pointer of an operator for debugging and verification.
ContextBuilder	Provides a series of APIs for you to manually build the TilingContext class to verify the tiling functions and the KernelContext class to verify the TilingParse functions.

**Table 31** APIs for tiling template programming
API	Description
Template Argument Definition	Defines the template argument declaration ASCENDC_TPL_ARGS_DECL and template argument selection ASCENDC_TPL_ARGS_SEL (available template).
GET_TPL_TILING_KEY	Automatically generates a TilingKey during tiling template programming. This API converts the passed template arguments into binary values based on the defined bit width, combines the binary values in sequence, and then converts the values into uint64, that is, TilingKey.
ASCENDC_TPL_SEL_PARAM

**Table 32** APIs for offloading tiling
API	Description
DEVICE_IMPL_OP_OPTILING

**Table 33** RTC API List
API	Description
aclrtcCompileProg	A compilation API that compiles a specified program.
aclrtcCreateProg	Creates an instance of the compiler based on the given parameters.
aclrtcDestroyProg	Destroys the instance of a compiler.
aclrtcGetBinData	Obtains the compiled binary data.
aclrtcGetBinDataSize	Obtains the size of the compiled binary data. This function is used to allocate memory space of the corresponding size when aclrtcGetBinData is called to obtain the binary data.
aclrtcGetCompileLogSize	Obtains the size of the compilation log, which is used to allocate the memory space of the corresponding size when the log content is obtained in aclrtcGetCompileLog.
aclrtcGetCompileLog	Obtains the content of the compilation log and saves it as a string.

**Table 34** Log API list
API	Description
ASC_CPU_LOG	Provides the function of printing logs on the host. You can use the ASC_CPU_LOG_XXX API in the TilingFunc code of the operator to output related content.

AI CPU API

**Table 35** AI CPU APIs
API	Description
printf	This API is used to format the output in the AI CPU operator kernel debugging scenario. By default, the output is parsed and printed on the screen.
assert	This API is used to implement the assert function in the AI CPU operator kernel debugging scenario.