Ascend C APIs
Ascend C provides a group of class library APIs. You can use the standard C++ syntax and class library APIs for programming. Ascend C programming class library APIs are classified into the following types:
- Kernel APIs: implement the operator kernel function, including:
- Basic data structures: basic data structures used in kernel APIs, such as GlobalTensor and LocalTensor.
- Basic APIs: abstract hardware capabilities and open chip capabilities to ensure completeness and compatibility. APIs marked as Instruction Set Architecture Special Interface (ISASI, hardware architecture-related APIs) do not guarantee compatibility across hardware versions.
- High-level APIs: implement common computing algorithms to improve programming and development efficiency based on basic APIs. High-level APIs include math library, Matmul, Softmax, and others, and ensure compatibility.
- Host APIs:
- Tiling APIs: provide tiling parameters required for kernel computation.
- Ascend C operator prototype registration and management APIs: define and register the Ascend C operator prototype.
- Tiling data structure registration APIs: define and register the TilingData structure of the Ascend C operator.
- Platform information obtaining APIs: provide functions to obtain hardware platform information, such as the number of platform cores, to support tiling computation during tiling function implementation on the host.
- Operator debugging APIs: used for operator debugging, including twin debugging and performance debugging.
Basic data structures and APIs are required for Ascend C operator programming on the host. For details, see Basic Data Structures and APIs. Runtime APIs are required for operator calling after operator development. For details, see "AscendCL API Reference" in CANN AscendCL Application Software Development Guide (C&C++).

Kernel API - Basic APIs
|
API |
Function |
|---|---|
|
Obtains the number of 0s or 1s in a binary number of the uint64_t type. |
|
|
Computes the number of leading 0s of a uint64_t number (number of 0s from the most significant bit to the first 1 in the binary number). |
|
|
Converts the type of a scalar to a specified type. |
|
|
Computes the number of consecutive bits that are the same as the sign bit from the most significant bit in the binary number of the uint64_t type. |
|
|
Obtains the location where the first 0 or 1 appears in a binary number of the uint64_t type. |
|
|
Converts scalar data of the float type to scalar data of the bfloat16_t type. |
|
|
Converts scalar data of the bfloat16_t type to scalar data of the float type. |
|
Category |
API |
Function |
|---|---|---|
|
One-Operand Instructions |
Computes the natural exponent based on elements. |
|
|
Computes the natural logarithm based on elements. |
||
|
Computes the absolute value based on elements. |
||
|
Computes the reciprocal based on elements. |
||
|
Extracts the square root based on elements. |
||
|
Computes the reciprocal after square root extraction based on elements. |
||
|
Performs bitwise Not based on elements. |
||
|
Performs a ReLU operation based on elements. |
||
|
Two-Operand Instructions |
Performs addition based on elements. |
|
|
Performs subtraction based on elements. |
||
|
Performs multiplication based on elements. |
||
|
Performs division based on elements. |
||
|
Computes the maximum value based on elements. |
||
|
Computes the minimum value based on elements. |
||
|
Performs a bitwise AND operation based on elements. |
||
|
Performs a bitwise OR operation based on elements. |
||
|
Adds inputs element-wise and chooses the larger between the result and 0. |
||
|
Adds inputs element-wise and chooses the larger between the result and 0, and converts precision based on the data types of the source and destination operand tensors. |
||
|
Adds inputs element-wise, performs Deq quantization on the result, and then performs ReLU calculation on the result (obtains the larger between the result and 0). |
||
|
Computes the difference element-wise and chooses the larger between the result and 0. |
||
|
Computes the difference element-wise and chooses the larger between the result and 0, and converts precision based on the data types of the source and destination operand tensors. |
||
|
Multiplies src0Local and src1Local element-wise, adds them to dstLocal, and saves the final result to dstLocal. |
||
|
Performs precision conversion after the product is calculated based on elements. |
||
|
Multiplies src0Local and dstLocal element-wise, adds src1Local, and saves the result to dstLocal. |
||
|
Multiplies src0Local and dstLocal element-wise, adds them to src1Local, obtains the larger between the result and 0, and saves the final result to dstLocal. |
||
|
Two-Operand Scalar Instructions |
Performs addition between a scalar and a vector element-wise. |
|
|
Performs multiplication between a scalar and a vector element-wise. |
||
|
Compares the vector source operand and a scalar element-wise and chooses the maximum. |
||
|
Compares the vector source operand and a scalar element-wise and chooses the minimum. |
||
|
Performs logical left shift on the source operand element-wise. The shift distance is determined by the scalar argument. |
||
|
Performs right shift on the source operand element-wise. The shift distance is determined by the scalar argument. |
||
|
Computes Leaky ReLU on the source operand element-wise. |
||
|
Triple-operand Scalar Instructions |
Sums up the product of each element in the source operand (srcLocal) and a scalar and the corresponding element in the destination operand (dstLocal). |
|
|
Comparison Instructions |
Compares the sizes of two tensors element by element. If the comparison result is true, the corresponding bit of the output result is 1. Otherwise, the bit is 0. |
|
|
Compares the sizes of two tensors element by element. If the comparison result is true, the corresponding bit of the output result is 1. Otherwise, the bit is 0. This interface can be used when the mask parameter is required. The result is stored in a register. |
||
|
Compares the sizes of an element in a tensor with that of a scalar element by element. If the comparison result is true, the corresponding bit of the output result is 1. Otherwise, the bit is 0. |
||
|
Selection Instructions |
Selects the source operand src0 or src1 based on the bit value of selMask (mask used for selection) to obtain the destination operand dst. When the bit value of selMask is 1, src0 is selected. When the bit value of selMask is 0, src1 is selected. |
|
|
Selects elements from the source operand and writes them to the destination operand based on a gather mask (for data collection) that corresponds to either the binary of the built-in fixed mode or the binary of the user-defined input tensor values. |
||
|
Precision Conversion Instructions |
Converts precision based on the data types of the source and destination operand tensors. |
|
|
Quantizes the input and converts the precision. |
||
|
Reduction Instructions |
Obtains the maximum value and its corresponding index position among the input data. |
|
|
Obtains the minimum value and its corresponding index position among the input data. |
||
|
Sums up all input data. |
||
|
Computes the maximum value and index of all data in each repeat. |
||
|
Computes the minimum value and index of all data in each repeat. |
||
|
Sums all data in each repeat. |
||
|
Computes the maximum of all elements in each data block. |
||
|
Computes the minimum of all elements in each data block. |
||
|
Sums all elements in each data block. Source operands are added in binary tree mode. |
||
|
Sums two adjacent (odd and even) elements. |
||
|
Sums all data in each repeat. Compared with WholeReduceSum, it does not support the bitwise mask mode. You are advised to use WholeReduceSum with more comprehensive functions. |
||
|
Data Conversion |
Performs transpose on data blocks of a 16 x 16 2D matrix, and conversion between [N,C,H,W] and [N,H,W,C]. |
|
|
Converts the NCHW format to the NC1HWC0 format. It can also be used for transposing a two-dimensional matrix data block. |
||
|
Data Padding |
Copies a variable or an immediate for multiple times and fill it in the vector. |
|
|
Extracts eight elements from a given input tensor each time and fills them in eight data blocks (32 bytes) in the result tensor. Each element corresponds to a data block. |
||
|
Creates the vector index with firstValue as the start value. |
||
|
Data Scatter/Data Gather |
Gathers given input tensors by element to the result tensor based on the offset address tensor provided. |
|
|
Mask Operations |
Sets mask to counter mode. In this mode, you do not need to perceive the number of iterations or process unaligned tail blocks. You can directly pass in the amount of data to be computed. The actual number of iterations is automatically inferred by the Vector Unit. |
|
|
Sets mask to normal mode. This mode is the default mode. You can configure the number of iterations. |
||
|
Sets mask during Vector computation. |
||
|
Restores the mask value to the default (all 1s), indicating that all elements in each iteration participate in the Vector computation. |
||
|
Quantization Settings |
Sets the value of the DEQSCALE register. |
|
API |
Function |
|---|---|
|
Performs data movement, including common data movement, enhanced data movement, tiled data movement, and associated format conversion. |
|
|
Performs the movement instruction between VECIN, VECCALC, and VECOUT, and supports the mask operation and data block interval operation. |
|
API |
Function |
|---|---|
|
Manages resources such as the global memory. It allocates and manages resources such as memory. |
|
|
Obtains the TPipe pointer for the global memory managed by the framework. After obtaining the pointer, you can perform TPipe-related operations. |
|
|
Manually manages or reuses the Unified Buffer/L1 Buffer physical memory. It is mainly used when the Unified Buffer/L1 Buffer physical memory is insufficient in multi-stage computing. |
|
|
Performs EnQue and DeQue operations, and implements inter-task communication and synchronization through queues. |
|
|
Binds the source and destination logical locations to determine the memory allocation location and insert the corresponding synchronization event, solving problems such as memory allocation, management, and synchronization. |
|
|
Manages the memory occupied by some temporary variables used during Ascend C programming. |
|
|
Initializes the SPM buffer. |
|
|
Copies the data to be overflowed and temporarily stored to the SPM buffer. |
|
|
Reads data from the SPM buffer back to the local data. |
|
|
Obtains the workspace pointer used by the user. |
|
|
Sets the pointer to the system workspace, as the system workspace is used by the framework communication mechanism during fused operator programming. |
|
|
Obtains the pointer to the system workspace. |
|
|
Provides synchronization control. |
|
|
Synchronizes the AI Cores to avoid data dependency problems such as write-after-read, read-after-write, and write-after-write when different AI Cores operate the same global memory block. |
|
|
Synchronizes the AI Cores to avoid data dependency problems such as write-after-read, read-after-write, and write-after-write when different AI Cores operate the same global memory block. |
|
|
Synchronizes the AI Cores to avoid data dependency problems such as write-after-read, read-after-write, and write-after-write when different AI Cores operate the same global memory block. |
|
|
Initializes the value of the GM shared memory. WaitPreBlock and NotifyNextBlock can be called only after the initialization is complete. |
|
|
Reads the value in the GM address to determine whether to continue to wait. When the GM value meets the waiting condition of the current core, the core can proceed to the next operation. |
|
|
Writes the GM address to notify the next core that the operation of the current core is completed and the next core can perform the operation. |
|
API |
Function |
|---|---|
|
Preloads data from the specific DDR address where the source address is located to the data cache. |
|
|
Refreshes the cache to ensure cache consistency. |
|
API |
Function |
|---|---|
|
Obtains the number of blocks configured for the current task, which is used for multi-core logic control in the code. |
|
|
Obtains the index of the current core, which is used for multi-core logic control and multi-core offset computation in the code. |
|
|
Obtains the size (in byte) of a data block of the current chip version. You can compute the values of parameters such as repeatTimes, dataBlockStride, and repeatStride to be passed in the API instructions based on the data block size. |
|
|
Obtains the version number of the current AI processor architecture. |
|
|
Applies to the separated architecture and obtains the ratio of AICs to AIVs. |
|
API |
Function |
|---|---|
|
Sets whether to perform atomic addition for data transfer from VECOUT to GM, from L0C to GM, or from L1 to GM. The addition data type can be set based on different parameters. |
|
|
Sets different atomic operation data types using template parameters. |
|
|
Clears the status of an atomic operation. |
|
API |
Function |
|---|---|
|
Obtains the tiling information input by the kernel entry point function of the operator and fills the information in the registered tiling structure. This function is compiled in macro expansion mode. If a user has registered multiple TilingData structures, this API is used to return the default registered structure. |
|
|
Specifies a structure name to obtain the specified tiling information and fill the information in the corresponding tiling structure. This function is compiled in macro expansion mode. |
|
|
Obtains the member variables of a tiling structure. |
|
|
Checks whether the tiling_key in the current kernel function execution is equal to a specific key, so as to identify a kernel branch with tiling_key==key. |
|
|
Registers the default TilingData structure defined by the user using the standard C++ syntax on the kernel. |
|
|
Registers a custom TilingData structure that matches the TilingKey on the kernel. This API needs to provide a logical expression, which uses the string TILING_KEY_VAR to indicate the actual TilingKey and the range that the TilingKey meets. |
|
|
Sets the global default kernel type, which applies to all tiling keys. |
|
|
Sets the kernel type corresponding to a specific tiling key. |
|
Category |
API |
Function |
|---|---|---|
|
Vector computation |
Performs padding on the source operand by data block based on padMode and padSide. |
|
|
Performs bilinear interpolation operations, including vertical iteration and horizontal iteration. |
||
|
Obtains the comparison result of the Compare (Result Stored in a Register) instruction. |
||
|
Sets the comparison register for the APIs where Select does not specify the mask parameter. |
||
|
Obtains the computation result of ReduceSum (based on the first n pieces of tensor data). |
||
|
Obtains the maximum/minimum values and the corresponding index values in the scenario where ReduceMax and ReduceMin are consecutive. |
||
|
Inserts consecutive elements into the corresponding positions in Region Proposals. In each iteration, 16 consecutive elements are inserted into the corresponding positions in 16 Region Proposals. |
||
|
Extracts elements from corresponding positions in Region Proposals and rearranges them. In each iteration, 16 elements are extracted from 16 Region Proposals and arranged consecutively. The functionality of this API is the opposite of that of ProposalConcat. |
||
|
Sorts the Region Proposals based on their score fields in descending order. 16 Region Proposals are sorted in each iteration. |
||
|
Merges at most four sorted Region Proposal lists into one. The results are sorted in descending order of the score fields. |
||
|
Serves as a sorting function that can sort a maximum of 32 elements in each iteration. |
||
|
Merges at most four sorted lists into one. The results are sorted in descending order of the score fields. |
||
|
Obtains the number of region proposals in the queue processed by MrgSort or MrgSort4 and stores the number in the four List arguments in sequence. |
||
|
Gathers a given input tensor to the result tensor based on the offset address tensor provided. |
||
|
Generates a new result tensor based on a given continuous input tensor, a destination address offset tensor, and the offset address, and distributes the input tensor to the result tensor. |
||
|
Cube Computation |
Initializes LocalTensor (TPosition: A1, A2, B1, or B2) to a specific value. |
|
|
Provides the Load2D and Load3D data loading functions. |
||
|
Loads 2D data with transposing from A1/B1 to A2/B2. |
||
|
Sets AI preprocessing (AIPP) parameters for images. |
||
|
Moves image data from GM to A1/B1. During data movement, you can preprocess images, including image flipping, image resizing (clipping, cropping, scaling, and stretching), color space conversion (CSC), and type conversion. |
||
|
Loads the compression index table on the GM to internal registers. |
||
|
Decompress the data on the GM and move the data to A1, B1, and B2. |
||
|
Moves the 512-byte dense weight matrix stored in B1 to B2, and reads the 128-byte index matrix for sparseness of the dense matrix. |
||
|
Sets the attribute description of the feature map when Load3Dv1/Load3Dv2 is called. |
||
|
Sets A1/B1 boundary value when Load3D is called. |
||
|
Sets the repeat parameter of the Load3Dv2 API. After the repeat parameter is set, the Load3Dv2 API can be called once to complete the data movement for multiple iterations. |
||
|
Sets padValue for Load3Dv1/Load3Dv2. |
||
|
Performs matrix multiplication and addition. |
||
|
Performs matrix multiplication and addition operations. The input left matrix A is a sparse matrix, and the input right matrix B is a dense matrix. |
||
|
Processes the result after the matrix computation is complete. For example, the computation result is quantized and the data is moved from CO1 to the global memory. |
||
|
Sets the source operands of ReLU and quant. In the Fixpipe process, the ReLU (FixpipeParams. reluEn is set to true) and quant (FixpipeParams.QuantParams is set to a value other than NoQuant) processes are involved for ReLU and quant computation, respectively. |
||
|
Configures FixpipeNz2nd. In the Fixpipe process, the nz2nd process is involved (FixpipeParams.Nz2NdParams.nz2ndEn is set to true). |
||
|
Sets deq scalar (quantization parameter) in the quantization process involved in Fixpipe. |
||
|
Sets the maximum value of the ClipReLU operation after real-time quantization is performed during DataCopy (CO1 -> GM). |
||
|
Sets the address of the LocalTensor during the element-wise operation after real-time quantization is performed during DataCopy (CO1 -> GM). |
||
|
Sets register values, which is similar to SetHF32TransMode and SetMMLayoutTransform. SetHF32Mode is used to set the HF32 mode of the MMAD. |
||
|
Sets register values, which is similar to SetHF32Mode and SetMMLayoutTransform. SetHF32TransMode is used to set the HF32 rounding mode of the MMAD. It is valid only when the HF32 mode of the MMAD takes effect. |
||
|
Sets register values, which is similar to SetHF32Mode and SetHF32TransMode. SetMMLayoutTransform is used to set the M/N direction of the MMAD. |
||
|
Monitors the UB read and write operations within the specified range. If the UB read and write operations within the specified range are monitored, an EXCEPTION error is reported. If the UB read and write operations within the specified range are not monitored, no error is reported. |
||
|
Performs 2D convolution on a given input tensor and a weight tensor and outputs a result tensor. The Conv2d convolution layer is mostly used for image recognition, and a filter is used to extract features in an image. |
||
|
Multiplies two tensors and outputs a result tensor. Multiply matrix A and matrix B to obtain matrix C, and output matrix C. |
||
|
Data Movement |
Enables data non-aligned movement. |
|
|
Sets the value filled by DataCopyPad. |
||
|
Synchronization Control |
Synchronizes different pipelines in the same core. This synchronization operation needs to be inserted between different pipeline instructions with data dependency. |
|
|
Blocks a pipeline. This synchronization operation needs to be inserted between the same pipelines with data dependency. |
||
|
Blocks the execution of subsequent instructions until all previous memory access instructions (the memory location to be waited for can be controlled by parameters) are executed. |
||
|
Sets the synchronization between AICs and AIVs in the separated architecture. |
||
|
Waits for the synchronization between AICs and AIVs in the separated architecture. |
||
|
Cache Processing |
Preloads instructions to the iCache from the DDR address where the instructions are located. |
|
|
Obtains the PreLoad status of the iCache. |
||
|
System Variable Access |
Obtains the pointer to the program counter, which is used to record the current program execution position. |
|
|
Obtains the number of Vector Cores on the AI Core. |
||
|
Obtains the ID of the Vector Core on the AI Core. |
||
|
Obtains the number of cycles in the current system. If the number of cycles is converted to time (unit: μs), the frequency must be 50 MHz. The conversion formula is as follows: Time = (Number of cycles/50) μs. |
||
|
Atomic Operations |
Sets whether to perform atomic comparison for subsequent data transferred from VECOUT to GM, which compares the content to be copied with the existing content in GM and writes the maximum value to GM. |
|
|
Sets whether to perform atomic comparison for subsequent data transferred from VECOUT to GM, which compares the content to be copied with the existing content in GM and writes the minimum value to GM. |
||
|
Sets the atomic operation enabling flag and type. |
||
|
Obtains the value of the enabling flag and type of the atomic operation. |
||
|
Resource Management |
Controls communication between the AIC and the AIV through software synchronization in the separated architecture to implement compute resource grouping of AI Core. |
|
|
Controls synchronization when two AIV tasks in the same CubeResGroupHandle object depend on each other. |
||
|
Serves as the communication workspace descriptor, used to manage the message communication area division of different CubeResGroupHandle objects. It is used together with CubeResGroupHandle. The KfcWorkspace constructor is used to create a KfcWorkspace object. |
Kernel API - High-Level APIs
|
API |
Function |
|---|---|
|
Computes arc cosine element-wise. |
|
|
Computes inverse hyperbolic cosine element-wise. |
|
|
Computes arcsine element-wise. |
|
|
Computes hyperbolic arcsine element-wise. |
|
|
Computes arc tangent of a trigonometric function element-wise. |
|
|
Computes inverse hyperbolic tangent element-wise. |
|
|
Adds the product of each element of the source operand and a scalar to the corresponding element in the destination operand. |
|
|
Obtains the minimum integer value greater than or equal to x, that is, rounding towards positive infinity. |
|
|
Replaces the number greater than scalar with scalar in srcTensor and retains the number less than or equal to scalar as the dstTensor output. |
|
|
Replaces the number less than scalar with scalar in srcTensor and retains the number greater than or equal to scalar as the dstTensor output. |
|
|
Computes cosine of a trigonometric function element-wise. |
|
|
Computes hyperbolic cosine element-wise. |
|
|
Accumulates data by row or column. |
|
|
Computes the logarithmic derivative of the gamma function of x element-wise. |
|
|
Computes error function or Gaussian error function element-wise. |
|
|
Returns the complementary error function computing result of input x. The integral ranges from x to infinity. |
|
|
Computes the natural exponent element-wise. |
|
|
Obtains the minimum integer value less than or equal to x, that is, rounding towards negative infinity. |
|
|
Computes the remainder of two floating-point numbers element-wise. |
|
|
Returns decimals element-wise. |
|
|
Computes the absolute value and natural logarithm of the gamma function of x element-wise. |
|
|
Computes logarithm of bases e, 2, and 10 element-wise. |
|
|
Computes exponentiation element-wise. |
|
|
Rounds the input element to the nearest integer. |
|
|
Performs the Sign operation element-wise. Sign refers to the symbol that returns the input data. |
|
|
Computes sine element-wise. |
|
|
Computes hyperbolic sine element-wise. |
|
|
Computes tangent element-wise. |
|
|
Performs logistic regression Tanh element-wise. |
|
|
Truncates floating point numbers element-wise, that is, rounding towards zero. |
|
|
Performs the XOR operation element-wise. |
|
API |
Function |
|---|---|
|
Performs fake quantization by element. For example, apply fake quantization to convert the int8_t data type to the half type. |
|
|
Performs dequantization by element. For example, dequantize the int32_t data type to the half/float data type. |
|
|
Performs quantization by element. For example, quantize the half/float data type to the int8_t data type. |
|
API |
Function |
|---|---|
|
Normalizes each input feature of samples in each batch along the batch dimension. |
|
|
Serves as a replacement for LayerNorm normalization during the training process of a deep neural network. |
|
|
Divides the input C dimension into groups (groupNum) and standardizes each group of data. |
|
|
Normalizes the input data of network layers to the [0, 1] range to standardize the distributions of both input and output data across network layers. |
|
|
Computes the backpropagation gradient of LayerNorm. |
|
|
Obtains the reverse beta/gmma value and outputs pdx, gmma, and beta when used in conjunction with LayerNormGrad. |
|
|
Computes the reciprocal rstd of the standard deviation of the input data with shape [A, R] and the normalized output y based on the known mean value and variance in LayerNorm. |
|
|
Normalizes input data whose shape is [B, S, H] using RmsNorm. |
|
|
Implements preprocessing of the Welford algorithm. |
|
|
Implements postprocessing of the Welford algorithm. |
|
API |
Function |
|---|---|
|
Performs postprocessing on SoftMax compute results and adjusts SoftMax compute results to specified values. |
|
|
Implements an activation function of the simplified FastGelu version. |
|
|
Implements an activation function of the FastGeluV2 version. |
|
|
Serves as a GLU variant that uses GeLU as the activation function. |
|
|
Serves as an important activation function that is inspired by ReLU and dropout. The idea of random regular expression is introduced in activation. |
|
|
Performs LogSoftmax computation on the input tensor. |
|
|
Serves as a GLU variant that uses ReLU as the activation function. |
|
|
Performs logistic regression with Sigmoid element-wise. |
|
|
Computes Silu element-wise. |
|
|
Uses the computed sum and max data to perform softmax computation on the input tensor. |
|
|
Performs softmax computation on input tensors by row. |
|
|
Serves as the enhanced version of SoftMax, which not only performs softmaxflash computation on the input tensor, but updates the result of the current softmax computation based on the sum and max values obtained in the previous softmax computation. |
|
|
Serves as the enhanced version of SoftmaxFlash, corresponding to the FlashAttention-2 algorithm. |
|
|
Performs gradient backpropagation on input tensors. |
|
|
Performs gradient backpropagation on input tensors. |
|
|
Serves as a GLU variant that uses Swish as the activation function. |
|
|
Serves as a Swish activation function in neural networks. |
|
API |
Function |
|---|---|
|
Computes the mean of elements according to the direction of the last axis. |
|
|
Performs the XOR (bitwise XOR) operation by element and computes the sum of the results using ReduceSum. |
|
|
Obtains the sum of elements in the last dimension. |
|
API |
Function |
|---|---|
|
Obtains the first k maximum or minimum values of the last dimension and their corresponding indexes. |
|
|
Preprocesses the data and merges the source operand srcLocal to be sorted into the target data concatLocal. After the data is preprocessed, you can sort the data. |
|
|
Processes the sorting result data and outputs the sorted values and indexes. |
|
|
Sorts data in descending order by value. |
|
|
Merges at most four sorted lists into one. The results are sorted in descending order of the score fields. |
|
API |
Function |
|---|---|
|
Broadcasts the input based on the output shape. |
|
|
Pads the two-dimensional tensor (height x width) to 32-bytes alignment in the width direction. |
|
|
Unpads a two-dimensional tensor (height x width) in the width direction. |
|
API |
Function |
|---|---|
|
Provides the function of filtering the source operand based on the mask tensor to obtain the destination operand. |
|
API |
Function |
|---|---|
|
Given two source operands src0 and src1, selects elements based on the values (non-bit) of corresponding positions of maskTensor to obtain the destination operand dst. |
|
API |
Function |
|---|---|
|
Performs data format and reshape operations on the input data. |
|
API |
Function |
|---|---|
|
Returns an arithmetic progression given the start value, arithmetical value, and length. |
|
API |
Function |
|---|---|
|
Performs matrix multiplications. |
|
API |
Function |
|---|---|
|
Flexibly orchestrates collective communication tasks on the AI Core. |
|
API |
Function |
|---|---|
|
Initializes data in the global memory to a specified value. |
Host API
Operator Debugging API
|
API |
Function |
|---|---|
|
Dumps the content of specified tensors for operators developed based on operator projects. |
|
|
Implements the formatted output function in CPU- or NPU-side debugging for operators developed based on operator projects. |
|
|
Implements the assert function in CPU/NPU for operators developed based on operator projects. |
|
|
Dumps the content of specified tensors for operators developed based on operator projects. This API can be used to print tensors at a specified offset position. |
|
|
Stops the kernel when a software exception occurs. |
|
|
Creates shared memory during verification of the CPU-side operation of the kernel function. That is, creates a shared file in the /tmp directory and returns the mapping pointer to the file. |
|
|
Functions as the CPU commissioning entry and completes calls to CPU operator programs during verification of the CPU-side operation of the kernel function. |
|
|
Specifies tilingKey used for the current CPU debugging. During debugging, only the branch to which tilingKey corresponds in the operator kernel function is executed. |
|
|
Frees the shared memory allocated by GmAlloc during verification of the CPU-side operation of the kernel function. |
|
|
Sets the kernel mode to the single AIV mode, single AIC mode, or MIX mode to enable CPU commissioning of single AIV (vector) operators, single AIC (cube) operators, or MIX operators, respectively. |
|
|
Performs dotting in any running phase of the operator when the CAModel is used for operator performance simulation, to analyze the pipeline diagrams of different instructions for further performance tuning. Indicates dotting from the start point. This API is used together with TRACE_STOP. |
|
|
Performs dotting in any running phase of the operator when the CAModel is used for operator performance simulation, to analyze the pipeline diagrams of different instructions for further performance tuning. Indicates dotting from the end point. This API is used together with TRACE_START. |
|
|
Starts the profile data collection. This API is used together with MetricsProfStop. When using msProf for operator on-board tuning, you can call MetricsProfStart and MetricsProfStop before and after the code segment on the kernel to specify the scope of the code segment to be tuned. |
|
|
Stops the profile data collection. This API is used together with MetricsProfStart. When using msProf for operator on-board tuning, you can call MetricsProfStart and MetricsProfStop before and after the code segment on the kernel to specify the scope of the code segment to be tuned. |