Glossary

**Table 1** Glossary
Term/Acronym/Abbreviation	Definition
A1	AscendC::TPosition::A1 represents the logical memory used for cube computation on the device. It stores the left cube, and its corresponding physical memory is L1 Buffer of the AI Core.
A2	AscendC::TPosition::A2 represents the logical memory used for cube computation on the device. It stores left cube blocks (such as blocks that are split and adapted to the L0A Buffer capacity). The corresponding physical memory is L0A Buffer of the AI Core.
AI Core	A compute core of an Ascend AI Processor, which executes tasks with intensive cube and vector compute.
AIC	A Cube Core in a group of Cube Cores and Vector Cores in separation mode of AI Core.
AIV	A Vector Core in a group of Cube Cores and Vector Cores in separation mode of AI Core.
Ascend IR	Short for Ascend Intermediate Representation. It is an abstract data structure dedicated to an Ascend AI Processor and is used to represent the computation process. Unless otherwise specified, IR refers to Ascend IR by default.
B1	AscendC::TPosition::B1 represents the logical memory used for cube computation on the device. It stores the right cube, and its corresponding physical memory is L1 Buffer of the AI Core.
B2	AscendC::TPosition::B2 represents the logical memory used for cube computation on the device. It stores right cube blocks (such as blocks that are split and adapted to the L0B Buffer capacity). The corresponding physical memory is L0B Buffer of the AI Core.
Block	A block has multiple meanings in different scenarios. Generally, it refers to the logical core of the AI Core. Typical scenarios are as follows: Logical core of the AI Core: A block indicates a logical core of the AI Core. The ID of this block is a logical number starting from 0. DataBlock: A data block refers to a data unit that can be processed by an NPU vector computing instruction. Its size is usually 32 bytes. One instruction can be used to process multiple data blocks concurrently. Base block: It indicates the size of a typical data block required for one computation.
BlockID	A logical number starting from 0 used to represent an AI Core, which can be larger than the actual number of hardware cores.
BlockDim	The number of logical AI Cores used in computation, which is specified by you when a kernel function is called. Its value is generally equal to or greater than the actual number of physical cores.
BiasTable Buffer	BiasTable Buffer is a physical storage unit in the AI Core and is used to store bias data required for cube computation. It corresponds to the logical memory AscendC::TPosition::C2.
Broadcast	Broadcast is a tensor operation mechanism. Through broadcast, a smaller tensor can be automatically expanded to match the shape of a larger tensor.
C1	AscendC::TPosition::C1 represents the logical memory used for cube computation on the device. It stores the bias data, and its corresponding physical memory is L1 Buffer or Unified Buffer of the AI Core.
C2	AscendC::TPosition::C2 represents the logical memory used for cube computation on the device. It stores block-wise bias data (such as blocks that are split and adapted to the BT Buffer capacity). The corresponding physical memory is BT Buffer or L0C Buffer of the AI Core.
C2PIPE2GM	AscendC::TPosition::C2PIPE2GM represents the logical memory used for cube computation on the device. It stores quantization parameters, and its corresponding physical memory is Fixpipe Buffer of the AI Core.
Cache line	Smallest unit of data in cache (such as DCache, ICache, and L2 cache).
Core	A compute core with an independent Scalar Unit. The Scalar Unit is responsible for functions such as instruction transmission within the core and is also referred to as the scheduling unit within the core.
CO1	AscendC::TPosition::CO1 represents the logical memory used for cube computation on the device. It stores the block-wise cube computation result (such as cube computation result of blocks that are split). The corresponding physical memory is L0C Buffer of the AI Core.
CO2	AscendC::TPosition::CO2 represents the logical memory used for cube computation on the device. It stores the cube computation result (such as the final computation result of the original cube). The corresponding physical memory is the global memory or the Unified Buffer of the AI Core.
Compute	One of the three typical stages in the Ascend C operator programming paradigm, which is responsible for executing computation tasks.
CopyIn	One of the three typical stages in the Ascend C operator programming paradigm, which is responsible for moving data to be computed from the global memory to the local memory.
CopyOut	One of the three typical stages in the Ascend C operator programming paradigm, which is responsible for moving computation results from the local memory to the global memory.
Core ID	A physical number used to represent an AI Core, which corresponds one-to-one with the actual hardware cores.
Cube	Cube Unit on the AI Core, which executes cube operations. Take the float16 data type as an example. The Cube Unit can multiply two 16×16 cubes of the float16 data type in each execution.
Cube Core	A core dedicated for cube computation. It consists of the Scalar Unit, Cube Unit, and DMA Unit, but does not include the Vector Unit.
DataBlock	A data block is a data unit processed by a vector computing instruction. Its size is usually 32 bytes. A single instruction can process multiple data blocks concurrently.
DataBlock Stride	The stride of data blocks in a single repeat of a vector computing instruction, specifying the number of data blocks between the starting address of the next processing and that of the current processing.
DCache	Data cache. It is used to cache the data segments that may be repeatedly accessed by the Scalar Unit in the near term, thereby improving access efficiency.
Device	A device refers to the hardware device equipped with an Ascend AI processor. It connects to the host over the PCIe interface and provides the neural network (NN) compute capability for the host. Memory sharing among devices is not supported.
DMA	Short for direct memory access. The unit moves data between the global memory and local memory and between local memories at different levels, including MTE2 and MTE3.
DoubleBuffer/DB	A common optimization method in the parallel computing field. It improves the parallelism of data processing by creating multiple buffers that hold data.
Elementwise	An element-wise operation is an operation performed independently on each element of a tensor, with each result depending solely on its corresponding input element.
Fixpipe	A unit in the AI Core, which moves the cube computation results from L0C Buffer to the global memory or L1 Buffer. During the movement, operations such as quantization and activation are performed.
Fixpipe Buffer	A physical storage unit in the AI Core, which is used to store data such as quantization parameters required during data movement by Fixpipe. It corresponds to the logical memory AscendC::TPosition::C2PIPE2GM.
Global memory/GM	Main memory of the device, which is the external storage of the AI Core. It is used to store large-scale data. However, the access approach needs to be optimized to improve performance.
GlobalTensor	A tensor that stores global data in the global memory.
Host	A host refers to the x86 or Arm server connected to the device. The host utilizes the NN compute capabilities provided by the device to implement services.
ICache	Instruction cache. It is used to cache recently or frequently used instructions. For ultimate performance optimization, it is necessary to reduce ICache misses.
InferShape	Operator shape inference, which is used only in GE graph mode. During the actual network model generation process, tensor shapes and data types are inferred first. In this way, the data type and shape of each tensor can be known before graph execution, enabling early validation of correctness. In addition, the output tensor description of the operator is inferred in advance, including the tensor shape, data type, and data format. In the preparation phase of operator graph construction, memory can be statically allocated to all tensors to avoid overhead caused by dynamic memory allocation.
Kernel	Kernel function. It is a parallel function executed on the device. Kernel functions are marked with the __global__ qualifier. Multiple kernels execute the same kernel function in parallel. The main difference is that each kernel instance runs with a different block ID.
Kernel launch	The process of submitting a kernel program to the hardware for execution.
L0A Buffer	A physical storage unit in the AI Core, which is used to store the left cube for cube computation. It corresponds to the logical memory AscendC::TPosition::A2.
L0B Buffer	A physical storage unit in the AI Core, which is used to store the right cube for cube computation. It corresponds to the logical memory AscendC::TPosition::B2.
L0C Buffer	A physical storage unit in the AI Core, which is used to store the results of cube computation. It corresponds to the logical memory AscendC::TPosition::CO1.
L1 Buffer	A physical storage unit in the AI Core, which has a relatively large space and is usually used to cache the input data for cube computation. Generally, the input data for cube computation needs to be moved from the GM to L1 Buffer, and then to L0A Buffer and L0B Buffer. L1 Buffer corresponds to the logical memories AscendC::TPosition::A1 and AscendC::TPosition::B1.
L2 cache	A level-2 cache, which is used to store frequently accessed data to reduce the read/write operations on the global memory.
LCM	Short for local cache memory. AscendC::TPosition::LCM represents the temporarily shared Unified Buffer space, which implements the same function as VECCALC.
Local memory	Internal storage of the AI Core, including storage units such as L1 Buffer, L0A Buffer, L0B Buffer, L0C Buffer, and Unified Buffer.
LocalTensor	A tensor that stores the local data in the local memory of the AI Core.
Mask	It is used to control which elements participate in computation in each repeat of a vector computing instruction. It can be set in either contiguous mode or bitwise mode.
MTE1	Short for memory transfer engine 1, which is a data transfer engine of the AI Core. It transfers data from L1 Buffer to L0A Buffer, L0B Buffer, or other storage units. Note: The hardware capabilities may vary.
MTE2	Short for memory transfer engine 2, which is a data transfer engine of the AI Core. It transfers data from the GM to L1 Buffer, L0A Buffer, L0B Buffer, Unified Buffer, or other storage units. Note: The hardware capabilities may vary.
MTE3	Short for memory transfer engine 3, which is a data transfer engine of the AI Core. It transfers data from the Unified Buffer to the global memory, L1 Buffer, or other storage units. Note: The hardware capabilities may vary.
NC1HWC0	A 5D data format, where C0 is closely related to the hardware architecture. This format can improve the computational efficiency of matrix multiplication.
NCHW	Feature map data stored in the layout of [Batch, Channels, Height, Width].
ND	A common format representing an N-dimensional tensor.
NHWC	Feature map data stored in the layout of [Batch, Height, Width, Channels].
NPU	Short for neural-network processing unit. It uses the data-driven parallel computing architecture and is dedicated to processing a large number of computing tasks in artificial intelligence (AI) applications.
OP	An operator (OP) is the fundamental unit for executing specific mathematical computations or operations in deep learning algorithms. Common examples include activation functions (such as ReLU), convolution (Conv), pooling, and normalization (such as Softmax). These operators can be combined to build neural network models.
OpType	A general term for a type of operators. For example, there may be multiple Add operators on a network, named Add1 and Add2 respectively. However, their type is simply Add.
Pipe	A core concept in the Ascend C programming paradigm, which is used to manage resources such as device memory in a unified manner. One kernel function can initialize only one pipe object.
Preload	Before a computing task starts, necessary instructions or data is preloaded to the cache to reduce the instruction or data access latency and improve the computational efficiency.
Reduce	Dimension reduction operation, which is used to reduce the dimensions of a multi-dimensional tensor. Common dimension reduction operations include summation, averaging, and computing maximum and minimum values.
Repeat	Each time a vector computing instruction is executed, eight data blocks are read for computation. This is called a repeat. In most cases, multiple repeats are required to complete the reading and computation of all data.
Repeat stride	Number of data blocks between the starting data address of the next repeat and that of the current repeat when the vector computing instruction is executed iteratively.
Repeat times	Number of times that a vector computing instruction is executed iteratively.
Scalar	Scalar Unit in the AI Core. It is primarily responsible for scalar computation and issuing instructions to other units (such as the MTE, Vector Unit, and Cube Unit).
SPMD	Short for Single-Program Multiple-Data, which is a parallel programming model that executes the same program on multiple cores, with each core processing different data.
Tensor	A tensor is a container for operator computation data. It is an N-dimensional data structure, most commonly represented as a scalar, vector, or cube. The elements of a tensor can include integer values, floating point values, or string values.
Tiling	Tiling refers to data partitioning and blocking. For large-scale data computation, multi-core tiling is required, and each core needs to be partitioned into multiple blocks for multiple computations.
TilingData	TilingData refers to the parameters related to data partitioning and blocking (such as the size of the block to be moved each time and the number of iterations). Due to the limited scalar computing capability of the device, tiling parameters are generally computed on the host and then transferred to the device for the kernel functions to use.
TilingFunc	Default function provided by an operator project for tiling computation on the host.
TilingKey	It is used to distinguish specialized implementations of different versions of a kernel function. Different tiling keys will generate different binaries.
TPosition	When managing physical memory at different levels, Ascend C uses an abstract logical position (TPosition) to express storage at each level, replacing on-chip physical storage and hiding the hardware architecture. TPosition types include VECIN, VECOUT, VECCALC, A1, A2, B1, B2, CO1, and CO2. VECIN, VECCALC, and VECOUT are used for vector programming, while A1, A2, B1, B2, CO1, and CO2 are used for cube programming.
TSCM	AscendC::TPosition::TSCM represents the logical memory corresponding to L1 Buffer. You need to manage it to efficiently utilize hardware resources. It is mainly used for Matmul computation. For example, you can cache a copy of TSCM data and flexibly configure it as cube A, cube B, or a bias cube for Matmul operations in different scenarios, thereby implementing memory reuse and optimizing computational efficiency.
Unified Buffer/UB	Internal storage unit of the AI Core, which is mainly used for vector computation and corresponds to logical memories AscendC::TPosition::VECIN, AscendC::TPosition::VECOUT, and AscendC::TPosition::VECCALC.
VECCALC	Vector calculation. AscendC::TPosition::VECCALC represents the logical memory used for vector computation on the device. It is used to store temporary variables, and the corresponding physical storage is Unified Buffer of the AI Core.
VECIN	Vector input. AscendC::TPosition::VECIN represents the logical memory used for vector computation on the device. It is used to store the input data for vector computation, and the corresponding physical storage is Unified Buffer of the AI Core.
VECOUT	Vector output. AscendC::TPosition::VECOUT represents the logical memory used for vector computation on the device. It is used to store the output data of vector computation, and the corresponding physical storage is Unified Buffer of the AI Core.
Vector	Vector Unit in the AI Core, which performs vector computation. Compared with the Cube Unit, the Vector Unit offers less robust compute power but more flexible computations (such as the reciprocal and square root calculations in mathematics).
Vector Core	A core dedicated for vector computation. It consists of the Scalar Unit, Vector Unit, and DMA Unit, but does not include the Cube Unit.
Workspace	Generally, it refers to a pre-allocated section of global memory used temporarily to store intermediate results or temporary data.
Debugging on the CPU	A twin debugging method provided by Ascend C. This method simulates the execution and debugging of kernel functions on the device using the CPU, focusing only on operator functionality and accuracy.
Base block	Typical data block size required for one calculation.
Kernel launch mode	A simple and direct way to call a kernel. After implementing the operator on the kernel and tiling on the host, you can use runtime APIs to complete operator kernel launch. In this simple and direct method that enables quick verification of operator functionality, tiling development is not restricted by the CANN framework.
Debugging on the NPU	A twin debugging method provided by Ascend C, which refers to debugging based on the NPU simulation software or NPU hardware.
Tiling offload	Tiling offload refers to offloading tiling computation to the AI CPU on the device, so that the entire computation can be efficiently completed on the device.
Separation mode	A working mode of the AI Core. In this mode, the Cube Unit and Vector Unit are separately scheduled by independent Scalar Units and are deployed on the Cube Core and Vector Core, respectively. By combining Cube Cores and Vector Cores in a certain ratio (1:N), such a combination is considered as an AI Core. The number of AI Cores is determined by the number of Cube Cores.
Twin debugging	An operator debugging method provided by Ascend C. It supports accuracy debugging on the CPU and accuracy/performance debugging on the NPU.
Pipeline task	Ascend C adopts a pipeline programming paradigm, which divides the processing program in the operator core into multiple pipeline tasks. Pipeline tasks refer to a parallel of tasks scheduled by a main program in a single-core processing program. Inside a kernel function, pipeline tasks can be used to implement parallel data processing, further improving performance.
Continuous mode	A mode that can be selected when a mask is used to control which elements participate in the vector computation in each repeat. It specifies the number of consecutive elements involved in the computation.
Coupling mode	A working mode of the AI Core. In this mode, the same Scalar Unit is used to schedule the Cube Unit and the Vector Unit simultaneously, and all units are deployed on the same AI Core.
Fused operator	A fused operator is formed by multiple independent small operators, and has equivalent functions as these operators, but usually with better performance. Vector and Cube operators can be fused based on specific algorithms to achieve performance benefits.
Integrating operators into a graph	It refers to running operators in GE graph mode. In this mode, all operators are first constructed into a graph, and then the graph is delivered to the Ascend AI Processor for execution through GE.
Operator prototype	An operator prototype is an abstract description of an operator, defining the inputs, outputs, attributes, and other information of the operator.
Merged compute and communication operator (MC2 operator)	An MC2 operator fuses collective communication and computing tasks that support pipeline parallelism, to improve performance during operator execution.
Bitwise mode	A mode that can be selected when a mask is used to control which elements participate in the vector computation in each repeat. In bitwise mode, elements are controlled by bit, where a bit value of 1 indicates participation in the computation, and a bit value of 0 indicates exclusion.
Custom operator project	An operator project generated by msOpGen and provided by Ascend C.

Parent topic: Concepts, Principles, and Glossary