Ascend C APIs
- Basic data structure: Basic data structures used in kernel APIs, such as GlobalTensor and LocalTensor.
- Basic APIs: implement abstract hardware capabilities and open chip capabilities to ensure completeness and compatibility. APIs marked as Instruction Set Architecture Special Interface (ISASI, hardware architecture-related APIs) do not guarantee compatibility across hardware versions.
- High-level APIs: implement common computing algorithms to improve programming and development efficiency based on basic APIs. High-level APIs include math library, Matmul, Softmax, and others, and ensure compatibility.
- Utils API (common auxiliary function): provides various common tool classes, covering functions such as standard library, platform information acquisition, runtime compilation, and log output, helping developers efficiently develop operators and optimize performance.

Basic Data Structure
|
API |
Description |
|---|---|
|
LocalTensor is used to store data in the local memory of the AI Core. It supports the logical positions VECIN, VECOUT, VECCALC, A1, A2, B1, B2, CO1, and CO2. |
|
|
Coordinate is essentially a tuple (tuple), and is used to indicate location information of a tensor in different dimensions, that is, a coordinate value. |
|
|
The Layout<Shape, Stride> data structure is a basic template class that describes the memory layout of multi-dimensional tensors. It maps the logical coordinate space to the one-dimensional memory address space based on the shape and stride information at compile time, providing basic support for complex tensor operations and hardware optimization. |
|
|
The TensorTrait data structure is a basic template class that describes tensor information, including the data type, logical location, and layout memory layout of the tensor. |
Basic APIs
|
API |
Function |
|---|---|
|
Obtains the number of 0s or 1s in a binary number of the uint64_t type. |
|
|
Computes the number of leading 0s of a uint64_t number (number of 0s from the most significant bit to the first 1 in the binary number). |
|
|
Converts the type of a scalar to a specified type. |
|
|
Computes the number of consecutive bits that are the same as the sign bit from the most significant bit in the binary number of the uint64_t type. |
|
|
Obtains the location where the first 0 or 1 appears in a binary number of the uint64_t type. |
|
|
Converts scalar data of the float type to scalar data of the bfloat16_t type. |
|
|
Converts scalar data of the bfloat16_t type to scalar data of the float type. |
|
Category |
API |
Function |
|---|---|---|
|
Basic arithmetic |
Computes the natural exponent based on elements. |
|
|
Computes the natural logarithm based on elements. |
||
|
Computes the absolute value based on elements. |
||
|
Computes the reciprocal based on elements. |
||
|
Extracts the square root based on elements. |
||
|
Computes the reciprocal after square root extraction based on elements. |
||
|
Performs a ReLU operation based on elements. |
||
|
Performs addition based on elements. |
||
|
Performs subtraction based on elements. |
||
|
Performs multiplication based on elements. |
||
|
Performs division based on elements. |
||
|
Computes the maximum value based on elements. |
||
|
Computes the minimum value based on elements. |
||
|
Performs addition between a scalar and a vector element-wise. |
||
|
Performs multiplication between a scalar and a vector element-wise. |
||
|
Compares the vector source operand and a scalar element-wise and chooses the maximum. |
||
|
Compares the vector source operand and a scalar element-wise and chooses the minimum. |
||
|
Computes Leaky ReLU on the source operand element-wise. |
||
|
Logic-based computation |
Performs bitwise Not based on elements. |
|
|
Performs a bitwise AND operation based on elements. |
||
|
Performs a bitwise OR operation based on elements. |
||
|
Performs left shift on the source operand element-wise. The shift distance is determined by scalarValue. |
||
|
Performs right shift on the source operand element-wise. The shift distance is determined by scalarValue. |
||
|
Compound computation |
Adds the product of each element in the source operand and a scalar to the corresponding element in the destination operand. |
|
|
Quantizes the input and converts the precision. |
||
|
Adds inputs element-wise and chooses the larger between the result and 0. |
||
|
Adds inputs element-wise and chooses the larger between the result and 0, and converts precision based on the data types of the source and destination operand tensors. |
||
|
Adds inputs element-wise, performs Deq quantization on the result, and then performs ReLU calculation on the result (obtains the larger between the result and 0). |
||
|
Computes the difference element-wise and chooses the larger between the result and 0. |
||
|
Computes the difference element-wise and chooses the larger between the result and 0, and converts precision based on the data types of the source and destination operand tensors. |
||
|
Multiplies src0Local and src1Local element-wise, adds them to dstLocal, and saves the final result to dstLocal. |
||
|
Performs multiplication based on elements and converts precision based on the data types of the source and destination operand tensors. |
||
|
Multiplies src0Local and dstLocal element-wise, adds src1Local, and saves the result to dstLocal. |
||
|
Multiplies src0Local and dstLocal element-wise, adds them to src1Local, chooses the larger between the result and 0, and saves the final result to dstLocal. |
||
|
Comparison and selection |
Compares the sizes of two tensors element by element. If the comparison result is true, the corresponding bit of the output result is 1. Otherwise, the bit is 0. |
|
|
Compares the sizes of two tensors element by element. If the comparison result is true, the corresponding bit of the output result is 1. Otherwise, the bit is 0. This interface can be used when the mask parameter is required. The result is stored in a register. |
||
|
Compares the sizes of an element in a tensor with that of a scalar element by element. If the comparison result is true, the corresponding bit of the output result is 1. Otherwise, the bit is 0. |
||
|
Selects the source operand src0 or src1 based on the bit value of selMask (mask used for selection) to obtain the destination operand dst. When the bit value of selMask is 1, src0 is selected. When the bit value of selMask is 0, src1 is selected. |
||
|
Selects elements from the source operand and writes them to the destination operand based on a gather mask (for data collection) that corresponds to either the binary of the built-in fixed mode or the binary of the user-defined input tensor values. |
||
|
Precision Conversion Instructions |
Converts precision based on the data types of the source and destination operand tensors. |
|
|
Reduction computation |
Obtains the maximum value and its corresponding index position among the input data. |
|
|
Obtains the minimum value and its corresponding index position among the input data. |
||
|
Sums up all input data. |
||
|
Computes the maximum value and index of all data in each repeat. |
||
|
Computes the minimum value and index of all data in each repeat. |
||
|
Sums all data in each repeat. |
||
|
Calculates the maximum value of all elements in each repeat. |
||
|
Calculates the minimum value of all elements in each repeat. |
||
|
Sums up all elements in each repeat. Source operands are added in binary tree mode. |
||
|
Sums two adjacent (odd and even) elements. |
||
|
Sums all data in each repeat. Compared with WholeReduceSum, it does not support the bitwise mask mode. You are advised to use WholeReduceSum with more comprehensive functions. |
||
|
Data Conversion |
Performs transpose on data blocks of a 16 x 16 2D matrix, and conversion between [N,C,H,W] and [N,H,W,C]. |
|
|
Converts the NCHW format to the NC1HWC0 format. It can also be used for transposing a two-dimensional matrix data block. |
||
|
Data Padding |
Copies a variable or an immediate for multiple times and fill it in the vector. |
|
|
Extracts eight elements from a given input tensor each time and fills them in eight data blocks (32 bytes) in the result tensor. Each element corresponds to a data block. |
||
|
Creates the vector index with firstValue as the start value. |
||
|
Data Scatter/Data Gather |
Gathers given input tensors by element to the result tensor based on the offset address tensor provided. |
|
|
Mask Operations |
Sets mask to counter mode. In this mode, you do not need to perceive the number of iterations or process unaligned tail blocks. You can directly pass in the amount of data to be computed. The actual number of iterations is automatically inferred by the Vector Unit. |
|
|
Sets mask to normal mode. This mode is the default mode. You can configure the number of iterations. |
||
|
Sets mask during Vector computation. |
||
|
Restores the mask value to the default (all 1s), indicating that all elements in each iteration participate in the Vector computation. |
||
|
Quantization Settings |
Sets the value of the DEQSCALE register. |
|
API |
Function |
|---|---|
|
Performs data movement, including common data movement, enhanced data movement, tiled data movement, and associated format conversion. |
|
|
Performs the movement instruction between VECIN, VECCALC, and VECOUT, and supports the mask operation and data block interval operation. |
|
API |
Function |
|---|---|
|
Manages resources such as the Global Memory. It allocates and manages resources such as memory. |
|
|
Obtains the TPipe pointer for the Global Memory managed by the framework. After obtaining the pointer, you can perform TPipe-related operations. |
|
|
Manually manages or reuses the Unified Buffer/L1 Buffer physical memory. It is mainly used when the Unified Buffer/L1 Buffer physical memory is insufficient in multi-stage computing. |
|
|
Performs EnQue and DeQue operations, and implements inter-task synchronization through queues. |
|
|
Binds the source and destination logical locations to determine the memory allocation location and insert the corresponding synchronization event, solving problems such as memory allocation, management, and synchronization. |
|
|
Manages the memory occupied by some temporary variables used during Ascend C programming. |
|
|
Initializes the SPM buffer. |
|
|
Copies the data to be overflowed and temporarily stored to the SPM buffer. |
|
|
Reads data from the SPM buffer back to the local data. |
|
|
Obtains the workspace pointer used by the user. |
|
|
Sets the pointer to the system workspace, as the system workspace is used by the framework communication mechanism during fused operator programming. |
|
|
Obtains the pointer to the system workspace. |
|
API |
Description |
|---|---|
|
Provides synchronization control. You can use this type of APIs to implement synchronization control. |
|
|
Synchronizes the AI Cores to avoid data dependency problems such as write-after-read, read-after-write, and write-after-write when different AI Cores operate the same global memory block. IBSet is called to set the flag bit of a core. IBSet and IBWait are used in pairs to indicate the synchronous waiting instruction between cores, waiting for the completion of a core operation. |
|
|
When different AI Cores operate the same global memory block, this function can be called to synchronize the AI Cores to avoid data dependency problems such as write-after-read, read-after-write, and write-after-write. IBWait and IBSet are used in pairs to indicate the synchronous waiting instruction between cores, waiting for the completion of a core operation. |
|
|
Synchronizes the AI Cores to avoid data dependency problems such as write-after-read, read-after-write, and write-after-write when different AI Cores operate the same global memory block. Currently, multi-core synchronization is classified into hardware synchronization and software synchronization. Hardware synchronization uses the full-core synchronization instruction of the hardware to ensure multi-core synchronization. Software synchronization is implemented through software algorithm simulation. |
|
|
Initializes the value of the GM shared memory. WaitPreBlock and NotifyNextBlock can be called only after the initialization is complete. |
|
|
Reads the value in the GM address to determine whether to continue to wait. When the GM value meets the waiting condition of the current core, the core can proceed to the next operation. |
|
|
Writes the GM address to notify the next core that the operation of the current core is completed and the next core can perform the operation. |
|
|
is called in the sub-kernel of SuperKernel. The called instruction can be implemented in parallel with other sub-kernel, improving the overall performance. |
|
|
is called in the sub-kernel of SuperKernel. The instructions before calling can be implemented in parallel with other sub-kernel to improve the overall performance. |
|
API |
Function |
|---|---|
|
Preloads data from the specific DDR address where the source address is located to the data cache. |
|
|
Refreshes the cache to ensure cache consistency. |
|
API |
Function |
|---|---|
|
Sets whether to perform atomic addition for data transfer from VECOUT to GM, from L0C to GM, or from L1 to GM. The addition data type can be set based on different parameters. |
|
|
Sets different atomic operation data types using template parameters. |
|
|
Clears the status of an atomic operation. |
|
API |
Function |
|---|---|
|
Dumps the content of specified tensors for operators developed based on operator projects. |
|
|
Implements the formatted output function in CPU- or NPU-side debugging for operators developed based on operator projects. |
|
|
ascendc_assert provides an API for implementing the assertion function in the CPU or NPU domain. When the assertion condition is not met, the system outputs the assertion information and prints it in a formatted manner on the screen. |
|
|
Implements the assert function in CPU/NPU for operators developed based on operator projects. |
|
|
Dumps the content of specified tensors for operators developed based on operator projects. This API can be used to print tensors at a specified offset position. |
|
|
Stops the kernel when a software exception occurs. |
|
|
Creates shared memory during verification of the CPU-side operation of the kernel function. That is, creates a shared file in the /tmp directory and returns the mapping pointer to the file. |
|
|
Functions as the CPU commissioning entry and completes calls to CPU operator programs during verification of the CPU-side operation of the kernel function. |
|
|
Specifies tilingKey used for the current CPU debugging. During debugging, only the branch to which tilingKey corresponds in the operator kernel function is executed. |
|
|
Frees the shared memory allocated by GmAlloc during verification of the CPU-side operation of the kernel function. |
|
|
Sets the kernel mode to the single AIV mode, single AIC mode, or MIX mode to enable CPU commissioning of single AIV (vector) operators, single AIC (cube) operators, or MIX operators, respectively. |
|
|
Performs dotting in any running phase of the operator when the CAModel is used for operator performance simulation, to analyze the pipeline diagrams of different instructions for further performance tuning. Indicates dotting from the start point. This API is used together with TRACE_STOP. |
|
|
Performs dotting in any running phase of the operator when the CAModel is used for operator performance simulation, to analyze the pipeline diagrams of different instructions for further performance tuning. Indicates dotting from the end point. This API is used together with TRACE_START. |
|
|
Starts the profile data collection. This API is used together with MetricsProfStop. When using msProf for operator on-board tuning, you can call MetricsProfStart and MetricsProfStop before and after the code segment on the kernel to specify the scope of the code segment to be tuned. |
|
|
Stops the profile data collection. This API is used together with MetricsProfStart. When using msProf for operator on-board tuning, you can call MetricsProfStart and MetricsProfStop before and after the code segment on the kernel to specify the scope of the code segment to be tuned. |
|
API |
Description |
|---|---|
|
Async provides a unified API for executing specific functions in different modes (AIC or AIV), thereby avoiding direct hardware condition judgment in code (such as using ASCEND_IS_AIV or ASCEND_IS_AIC). |
|
|
Obtains the Cube/Vector ratio, which is applicable to the Cube/Vector separation mode. |
|
API |
Function |
|---|---|
|
Obtains the tiling information input by the kernel entry point function of the operator and fills the information in the registered tiling structure. This function is compiled in macro expansion mode. If a user has registered multiple TilingData structures, this API is used to return the default registered structure. |
|
|
Specifies a structure name to obtain the specified tiling information and fill the information in the corresponding tiling structure. This function is built in macro expansion mode. |
|
|
Obtains the member variables of a tiling structure. |
|
|
Checks whether the tiling_key in the current kernel function execution is equal to a specific key, so as to identify a kernel branch with tiling_key==key. |
|
|
Registers the default TilingData structure defined by the user using the standard C++ syntax on the kernel. |
|
|
Registers a custom TilingData structure that matches the TilingKey on the kernel. This API needs to provide a logical expression, which uses the string TILING_KEY_VAR to indicate the actual TilingKey and the range that the TilingKey meets. |
|
|
When the TilingData structure customized using the standard C++ syntax is used in the kernel, if you are not sure about which structures need to be registered, you can use this API to notify the framework that the standard C++ syntax that is not registered is used to define TilingData. In addition, GET_TILING_DATA_WITH_STRUCT, GET_TILING_DATA_MEMBER, and GET_TILING_DATA_PTR_WITH_STRUCT are used to obtain the corresponding TilingData. |
|
|
Sets the global default kernel type, which applies to all tiling keys. |
|
|
Sets the kernel type corresponding to a specific tiling key. |
|
Category |
API |
Function |
|---|---|---|
|
Vector Computation |
Performs the padding operation on the source operand by the data block based on padMode and padSide. |
|
|
Performs bilinear interpolation operations, including vertical iteration and horizontal iteration. |
||
|
Obtains the comparison result of the Compare (Result Stored in a Register) instruction. |
||
|
Sets the comparison register for the APIs where Select does not specify the mask parameter. |
||
|
Obtains the computation result of ReduceSum (based on the first n data elements of a tensor). |
||
|
Obtains the maximum/minimum values and the corresponding index values in the scenario where ReduceMax and ReduceMin are consecutive. |
||
|
Inserts consecutive elements into the corresponding positions in Region Proposals. In each iteration, 16 consecutive elements are inserted into the corresponding positions in 16 Region Proposals. |
||
|
Extracts elements from corresponding positions in Region Proposals and rearranges them. In each iteration, 16 elements are extracted from 16 Region Proposals and arranged consecutively. The functionality of this API is the opposite of that of ProposalConcat. |
||
|
Sorts the Region Proposals based on their score fields in descending order. 16 Region Proposals are sorted in each iteration. |
||
|
Merges at most four sorted Region Proposal lists into one. The results are sorted in descending order of the score fields. |
||
|
Serves as a sorting function that can sort a maximum of 32 elements in each iteration. |
||
|
Merges at most four sorted lists into one. The results are sorted in descending order of the score fields. |
||
|
Obtains the number of region proposals in the queue processed by MrgSort or MrgSort4 and stores the number in the four List arguments in sequence. |
||
|
Gathers a given input tensor to the result tensor based on the offset address tensor provided. |
||
|
Generates a new result tensor based on a given continuous input tensor, a destination address offset tensor, and the offset address, and distributes the input tensor to the result tensor. |
||
|
Data Movement |
Enables data non-aligned movement. |
|
|
Sets the value filled by DataCopyPad. |
||
|
Cube Computation |
Performs matrix multiplication and addition. |
|
|
Performs matrix multiplication and addition operations. The input left matrix A is a sparse matrix, and the input right matrix B is a dense matrix. |
||
|
Sets register values, which is similar to SetHF32TransMode and SetMMLayoutTransform. SetHF32Mode is used to set the HF32 mode of the MMAD. |
||
|
Sets register values, which is similar to SetHF32Mode and SetMMLayoutTransform. SetHF32TransMode is used to set the HF32 rounding mode of the MMAD. It is valid only when the HF32 mode of the MMAD takes effect. |
||
|
Sets register values, which is similar to SetHF32Mode and SetHF32TransMode. SetMMLayoutTransform is used to set the M/N direction of the MMAD. |
||
|
Performs 2D convolution on a given input tensor and a weight tensor and outputs a result tensor. The Conv2d convolution layer is mostly used for image recognition, and a filter is used to extract features in an image. |
||
|
Multiplies two tensors and outputs a result tensor. Multiply matrix A and matrix B to obtain matrix C, and output matrix C. |
||
|
Sets the tensor quantization parameters in the real-time quantization during DataCopy (CO1->GM or CO1->A1). |
||
|
Sets the NZ2ND configuration in the real-time format conversion (NZ2ND) during DataCopy (CO1 -> GM or CO1 -> A1). |
||
|
Sets the scalar quantization parameters in the real-time quantization during DataCopy (CO1 -> GM or CO1 -> A1). |
||
|
Sets the maximum value of the ClipReLU operation after real-time quantization is performed during DataCopy (CO1 -> GM). |
||
|
Sets the address of LocalTensor during the element-wise operation after real-time quantization is performed during DataCopy (CO1 -> GM). |
||
|
Initializes LocalTensor (TPosition: A1, A2, B1, or B2) to a specific value. |
||
|
Provides the Load2D and Load3D data loading functions. |
||
|
Loads 2D data with transposing from A1/B1 to A2/B2. |
||
|
Sets AI preprocessing (AIPP) parameters for images. |
||
|
Transfers image data from the GM to A1/B1. During the transfer, you can preprocess images, including image flipping, image resizing (clipping, cropping, scaling, and stretching), color space conversion (CSC), and type conversion. |
||
|
Loads the compression index table on the GM to the internal register. |
||
|
Decompress the data on the GM and transfer the data to A1, B1, and B2. |
||
|
Moves the 512-byte dense weight matrix stored in B1 to B2, and reads the 128-byte index matrix for sparseness of the dense matrix. |
||
|
Sets the attribute description of the feature map when Load3Dv1/Load3Dv2 is called. |
||
|
Sets A1/B1 boundary value when Load3D is called. |
||
|
Sets the repeat parameter of the Load3Dv2 API. After the repeat parameter is set, the Load3Dv2 API can be called once to complete the data movement for multiple iterations. |
||
|
Sets padValue for Load3Dv1/Load3Dv2. |
||
|
Processes the result after the matrix computation is complete. For example, the computation result is quantized and the data is moved from CO1 to the Global Memory. |
||
|
Synchronization Control |
Synchronizes different pipelines in the same core. This synchronization operation needs to be inserted between different pipeline instructions with data dependency. |
|
|
Blocks a pipeline. This synchronization operation needs to be inserted between the same pipelines with data dependency. |
||
|
Blocks the execution of subsequent instructions until all previous memory access instructions (the memory location to be waited for can be controlled by parameters) are executed. |
||
|
Synchronization instruction between the Cube Unit (AIC) and Vector Unit (AIV) on the AI Core in separated mode. |
||
|
Synchronization wait instruction between the Cube Unit (AIC) and Vector Unit (AIV) on the AI Core in separated mode. |
||
|
Cache Processing |
Preloads instructions to the iCache from the DDR address where the instructions are located. |
|
|
Obtains the PreLoad status of the iCache. |
||
|
System Variable Access |
Obtains the pointer to the program counter, which is used to record the current program execution position. |
|
|
Obtains the number of Vector Cores on the AI Core. |
||
|
Obtains the ID of the Vector Core on the AI Core. |
||
|
Obtains the number of cycles in the current system. If the number of cycles is converted to time (unit: μs), the frequency must be 50 MHz. The conversion formula is as follows: Time = (Number of cycles/50) μs. |
||
|
Atomic Operations |
Sets whether to perform atomic comparison for subsequent data transferred from VECOUT to GM, which compares the content to be copied with the existing content in GM and writes the maximum value to GM. |
|
|
Sets whether to perform atomic comparison for subsequent data transferred from VECOUT to GM, which compares the content to be copied with the existing content in GM and writes the minimum value to GM. |
||
|
Sets the atomic operation enabling flag and type. |
||
|
Obtains the value of the enabling flag and type of the atomic operation. |
||
|
Debug Ports |
Monitors the UB read and write operations within the specified range. If the UB read and write operations within the specified range are monitored, an EXCEPTION error is reported. If the UB read and write operations within the specified range are not monitored, no error is reported. |
|
|
Cube group management |
CubeResGroupHandle is used to control the communication between the AI Core and AI Vector in split mode through software synchronization, implementing AI Core computing resource grouping. |
|
|
Controls synchronization when two AIV tasks in the same CubeResGroupHandle object depend on each other. |
||
|
Manages the message communication area division of different CubeResGroupHandle objects. It is a communication workspace descriptor and is used together with CubeResGroupHandle. The KfcWorkspace constructor is used to create a KfcWorkspace object. |
High-Level APIs
|
API |
Function |
|---|---|
|
Computes arc cosine element-wise. |
|
|
Computes inverse hyperbolic cosine element-wise. |
|
|
Computes arcsine element-wise. |
|
|
Computes hyperbolic arcsine element-wise. |
|
|
Computes arc tangent of a trigonometric function element-wise. |
|
|
Computes inverse hyperbolic tangent element-wise. |
|
|
Adds the product of each element of the source operand and a scalar to the corresponding element in the destination operand. |
|
|
Obtains the minimum integer value greater than or equal to x, that is, rounding towards positive infinity. |
|
|
Replaces the number greater than scalar with scalar in srcTensor and retains the number less than or equal to scalar as the dstTensor output. |
|
|
Replaces the number less than scalar with scalar in srcTensor and retains the number greater than or equal to scalar as the dstTensor output. |
|
|
Computes cosine of a trigonometric function element-wise. |
|
|
Computes hyperbolic cosine element-wise. |
|
|
Accumulates data by row or column. |
|
|
Computes the logarithmic derivative of the gamma function of x element-wise. |
|
|
Computes error function or Gaussian error function element-wise. |
|
|
Returns the complementary error function computing result of input x. The integral ranges from x to infinity. |
|
|
Computes the natural exponent element-wise. |
|
|
Obtains the minimum integer value less than or equal to x, that is, rounding towards negative infinity. |
|
|
Computes the remainder of two floating-point numbers element-wise. |
|
|
Returns decimals element-wise. |
|
|
Computes the absolute value and natural logarithm of the gamma function of x element-wise. |
|
|
Computes logarithm of bases e, 2, and 10 element-wise. |
|
|
Computes exponentiation element-wise. |
|
|
Rounds the input element to the nearest integer. |
|
|
Performs the Sign operation element-wise. Sign refers to the symbol that returns the input data. |
|
|
Computes sine element-wise. |
|
|
Computes hyperbolic sine element-wise. |
|
|
Computes tangent element-wise. |
|
|
Performs logistic regression Tanh element-wise. |
|
|
Truncates floating point numbers element-wise, that is, rounding towards zero. |
|
|
Performs the XOR operation element-wise. |
|
API |
Function |
|---|---|
|
Performs fake quantization by element. For example, apply fake quantization to convert the int8_t data type to the half type. |
|
|
Performs dequantization by element. For example, dequantize the int32_t data type to the half/float data type. |
|
|
Performs quantization by element. For example, quantize the half/float data type to the int8_t data type. |
|
API |
Function |
|---|---|
|
Normalizes each input feature of samples in each batch along the batch dimension. |
|
|
Serves as a replacement for LayerNorm normalization during the training process of a deep neural network. |
|
|
Divides the input C dimension into groups (groupNum) and standardizes each group of data. |
|
|
Normalizes the input data of network layers to the [0, 1] range to standardize the distributions of both input and output data across network layers. |
|
|
Computes the backpropagation gradient of LayerNorm. |
|
|
Obtains the reverse beta/gmma value and outputs pdx, gmma, and beta when used in conjunction with LayerNormGrad. |
|
|
Computes the reciprocal rstd of the standard deviation of the input data with shape [A, R] and the normalized output y based on the known mean value and variance in LayerNorm. |
|
|
Normalizes input data whose shape is [B, S, H] using RmsNorm. |
|
|
Implements preprocessing of the Welford algorithm. |
|
|
Implements postprocessing of the Welford algorithm. |
|
API |
Function |
|---|---|
|
Performs postprocessing on SoftMax compute results and adjusts SoftMax compute results to specified values. |
|
|
Implements an activation function of the simplified FastGelu version. |
|
|
Implements an activation function of the FastGeluV2 version. |
|
|
Serves as a GLU variant that uses GeLU as the activation function. |
|
|
Serves as an important activation function that is inspired by ReLU and dropout. The idea of random regular expression is introduced in activation. |
|
|
Performs LogSoftmax computation on the input tensor. |
|
|
Serves as a GLU variant that uses ReLU as the activation function. |
|
|
Performs logistic regression with Sigmoid element-wise. |
|
|
Computes Silu element-wise. |
|
|
Uses the computed sum and max data to perform softmax computation on the input tensor. |
|
|
Performs softmax computation on input tensors by row. |
|
|
Serves as the enhanced version of SoftMax, which not only performs softmaxflash computation on the input tensor, but updates the result of the current softmax computation based on the sum and max values obtained in the previous softmax computation. |
|
|
Serves as the enhanced version of SoftmaxFlash, corresponding to the FlashAttention-2 algorithm. |
|
|
Serves as the enhanced version of SoftmaxFlash, corresponding to the Softmax PASA algorithm. |
|
|
Performs gradient backpropagation on input tensors. |
|
|
Performs gradient backpropagation on input tensors. |
|
|
Serves as a GLU variant that uses Swish as the activation function. |
|
|
Serves as a Swish activation function in neural networks. |
|
API |
Function |
|---|---|
|
Obtains the sum of elements in the last dimension. |
|
|
Computes the mean of elements according to the direction of the last axis. |
|
|
Performs the XOR (bitwise XOR) operation by element and computes the sum of the results using ReduceSum. |
|
|
Accumulates data of a multi-dimensional vector based on a specified dimension. |
|
|
Calculates the average value of a multi-dimensional vector along a specified dimension. |
|
|
Returns the maximum value of a multi-dimensional vector in a specified dimension. |
|
|
Returns the minimum value of a multi-dimensional vector in a specified dimension. |
|
|
Calculates the logical OR of a multi-dimensional vector along a specified dimension. |
|
|
Calculates the logical AND of a multi-dimensional vector along a specified dimension. |
|
|
Calculates the product of a multi-dimensional vector along a specified dimension. |
|
API |
Function |
|---|---|
|
Obtains the first k maximum or minimum values of the last dimension and their corresponding indexes. |
|
|
Preprocesses the data and merges the source operand srcLocal to be sorted into the target data concatLocal. After the data is preprocessed, you can sort the data. |
|
|
Processes the sorting result data and outputs the sorted values and indexes. |
|
|
Sorts data in descending order by value. |
|
|
Merges at most four sorted lists into one. The results are sorted in descending order of the score fields. |
|
API |
Function |
|---|---|
|
Given two source operands src0 and src1, selects elements based on the values (non-bit) of corresponding positions of maskTensor to obtain the destination operand dst. |
|
|
Provides the function of filtering the source operand based on the mask tensor to obtain the destination operand. |
|
API |
Function |
|---|---|
|
Performs data format and reshape operations on the input data. |
|
|
Converts the layout format of the input data to the target layout format. |
|
|
Broadcasts the input based on the output shape. |
|
|
Pads the two-dimensional tensor (height x width) to 32-bytes alignment in the width direction. |
|
|
Unpads a two-dimensional tensor (height x width) in the width direction. |
|
|
Initializes data in the global memory to a specified value. |
|
API |
Function |
|---|---|
|
Returns an arithmetic progression given the start value, arithmetical value, and length. |
|
API |
Function |
|---|---|
|
Performs matrix multiplications. |
|
API |
Function |
|---|---|
|
Orchestrates collective communication tasks on the AI Core. |
|
API |
Description |
|---|---|
|
Performs forward 3D convolution matrix operation. |
|
|
Performs the backward convolution operation to calculate the backpropagation error of the feature matrix. |
|
|
Performs the backward convolution operation to calculate the backpropagation error of the weight. |
Utils API
|
API |
Description |
|---|---|
|
Compares two operands of the same data type and returns the larger value. |
|
|
Compares two operands of the same data type and returns the smaller value. |
|
|
Generates an integer sequence. |
|
|
Stores multiple elements of different types as a container. |
|
|
Extracts elements from the tuple container at a specified position. |
|
|
Creates a tuple object conveniently. |
|
|
Determines whether implicit conversion can be performed between two types during program build. |
|
|
Determines whether a type is a base class of another type during program build. |
|
|
Determines whether two types are the same during program build. |
|
|
Enables or disables a specific function template, class template, or template specialization based on a condition during program build. |
|
|
Selects one of two types based on a Boolean condition during program build. |
|
|
Encapsulates a compile-time constant integer value, which is the fundamental component of many type traits and compile-time computations in the standard library. |
|
API |
Description |
|---|---|
|
To implement the Tiling function on the host, certain hardware platform information, such as the number of cores on a hardware platform, may be required for Tiling calculation. The PlatformAscendC class provides a function for obtaining such platform information. |
|
|
Obtains hardware platform information, such as the number of cores on the hardware platform, to call operators in the basic mode (kernel launch) based on the kernel launch operator project. The PlatformAscendCManager class provides the function of obtaining platform information. |
|
API |
Description |
|---|---|
|
Registers the prototype definition of an operator. |
|
|
Defines the operator prototype. |
|
|
Defines operator parameters. |
|
|
Defines operator attributes. |
|
|
Defines the implementation information of the AI Processor and associates the tiling implementation and shape inference functions. |
|
|
Configures AI Core information. |
|
|
Configures the communicator name of the MC2 operator on the host. After the configuration, the context address corresponding to the communicator can be obtained on the kernel. |
|
API |
Description |
|---|---|
|
Defines a TilingData class and adds required member variables (TilingData fields) to store required TilingData parameters. After the TilingData class is defined, this class inherits the TilingDef class (base class for storing and processing user-defined Tiling structure member variables) to provide APIs for setting, serializing, and saving TilingData fields. |
|
|
Registers the defined TilingData structure and binds it with a custom operator. |
|
API |
Description |
|---|---|
|
Defines the template argument declaration ASCENDC_TPL_ARGS_DECL and template argument selection ASCENDC_TPL_ARGS_SEL (available template). |
|
|
Automatically generates a TilingKey during tiling template programming. This API converts the passed template arguments into binary values based on the defined bit width, combines the binary values in sequence, and then converts the values into uint64, that is, TilingKey. |
|
|
API |
Description |
|---|---|
|
API |
Description |
|---|---|
|
A compilation API that compiles a specified program. |
|
|
Creates an instance of the compiler based on the given parameters. |
|
|
Destroys the instance of a compiler. |
|
|
Obtains the compiled binary data. |
|
|
Obtains the size of the compiled binary data. This function is used to allocate memory space of the corresponding size when aclrtcGetBinData is called to obtain the binary data. |
|
|
Obtains the size of the compilation log, which is used to allocate the memory space of the corresponding size when the log content is obtained in aclrtcGetCompileLog. |
|
|
Obtains the content of the compilation log and saves it as a string. |
|
API |
Description |
|---|---|
|
Provides the function of printing logs on the host. You can use the ASC_CPU_LOG_XXX API in the TilingFunc code of the operator to output related content. |