Glossary
Term/Acronym/Abbreviation |
Definition |
|---|---|
A1 |
AscendC::TPosition::A1 represents the logical memory used for cube computation on the device. It stores the left cube, and its corresponding physical memory is the L1 buffer of the AI Core. |
A2 |
AscendC::TPosition::A2 represents the logical memory used for cube computation on the device. It stores left cube blocks (such as blocks that are split and adapted to the L0A buffer capacity). The corresponding physical memory is the L0A buffer of the AI Core. |
AI Core |
A compute core of an Ascend AI Processor, which executes tasks with intensive cube and vector compute. |
AIC |
A Cube Core in a group of Cube Cores and Vector Cores in separation mode of AI Core. |
AIV |
A Vector Core in a group of Cube Cores and Vector Cores in separation mode of AI Core. |
Ascend IR |
Short for Ascend Intermediate Representation. It is an abstract data structure dedicated to an Ascend AI Processor and is used to represent the computation process. Unless otherwise specified, IR refers to Ascend IR by default. |
B1 |
AscendC::TPosition::B1 represents the logical memory used for cube computation on the device. It stores the right cube, and its corresponding physical memory is the L1 buffer of the AI Core. |
B2 |
AscendC::TPosition::B2 represents the logical memory used for cube computation on the device. It stores right cube blocks (such as blocks that are split and adapted to the L0B buffer capacity). The corresponding physical memory is the L0B buffer of the AI Core. |
Block |
A block has multiple meanings in different scenarios. Generally, it refers to the logical core of the AI Core. Typical scenarios are as follows:
|
BlockID |
A logical number starting from 0 used to represent an AI Core, which can be larger than the actual number of hardware cores. |
BlockDim |
The number of logical AI Cores used in computation, which is specified by a developer when a kernel function is called. Its value is generally equal to or greater than the actual number of physical cores. |
BiasTable Buffer |
BiasTable Buffer is a physical storage unit in the AI Core and is used to store bias data required for cube computation. It corresponds to the logical memory AscendC::TPosition::C2. |
Broadcast |
Broadcast is a tensor operation mechanism. Through broadcast, a smaller tensor can be automatically expanded to match the shape of a larger tensor. |
C1 |
AscendC::TPosition::C1 represents the logical memory used for cube computation on the device. It stores the bias data, and its corresponding physical memory is the L1 buffer or Unified Buffer of the AI Core. |
C2 |
AscendC::TPosition::C2 represents the logical memory used for cube computation on the device. It stores block-wise bias data (such as blocks that are split and adapted to the BT buffer capacity). The corresponding physical memory is the BT buffer or L0C buffer of the AI Core. |
C2PIPE2GM |
AscendC::TPosition::C2PIPE2GM represents the logical memory used for cube computation on the device. It stores quantization parameters, and its corresponding physical memory is the Fixpipe Buffer of the AI Core. |
Cache line |
Smallest unit of data in cache (such as DCache, ICache, and L2 cache). |
Core |
A compute core with an independent Scalar Unit. The Scalar Unit is responsible for functions such as instruction transmission within the core and is also referred to as the scheduling unit within the core. |
CO1 |
AscendC::TPosition::CO1 represents the logical memory used for cube computation on the device. It stores the block-wise cube computation result (such as cube computation result of blocks that are split). The corresponding physical memory is the L0C buffer of the AI Core. |
CO2 |
AscendC::TPosition::CO2 represents the logical memory used for cube computation on the device. It stores the cube computation result (such as the final computation result of the original cube). The corresponding physical memory is the global memory or the Unified Buffer of the AI Core. |
Compute |
One of the three typical stages in the Ascend C operator programming paradigm, which is responsible for executing computation tasks. |
CopyIn |
One of the three typical stages in the Ascend C operator programming paradigm, which is responsible for moving data to be computed from the global memory to the local memory. |
CopyOut |
One of the three typical stages in the Ascend C operator programming paradigm, which is responsible for moving computation results from the local memory to the global memory. |
Core ID |
A physical number used to represent an AI Core, which corresponds one-to-one with the actual hardware cores. |
Cube |
Cube Unit on the AI Core, which executes cube operations. Take the float16 data type as an example. The Cube Unit can multiply two 16×16 cubes of the float16 data type in each execution. |
Cube Core |
A core dedicated for cube computation. It consists of the Scalar Unit, Cube Unit, and DMA Unit, but does not include the Vector Unit. |
DataBlock |
A data block is a data unit processed by a vector computing instruction. Its size is usually 32 bytes. A single instruction can process multiple data blocks concurrently. |
DataBlock Stride |
The stride of data blocks in a single repeat of a vector computing instruction, specifying the number of data blocks between the starting address of the next processing and that of the current processing. |
DCache |
Data cache. It is used to cache the data segments that may be repeatedly accessed by the Scalar Unit in the near term, thereby improving access efficiency. |
Device |
A device refers to the hardware device equipped with an Ascend AI processor. It connects to the host over the PCIe interface and provides the NN computing capability for the host. Memory sharing among devices is not supported. |
DMA |
Short for direct memory access. The unit moves data between the global memory and local memory and between local memories at different levels, including MTE2 and MTE3. |
DoubleBuffer/DB |
A common optimization method in the parallel field. It improves the parallelism of data processing by creating multiple buffers that hold data. |
Elementwise |
An element-wise operation is an operation performed independently on each element of a tensor. The result of each element depends only on the corresponding input element. |
Fixpipe |
A unit in the AI Core, which is responsible for moving the matrix computation result from the L0C Buffer to the global memory or L1 Buffer. During the movement, operations such as quantization and activation are performed. |
Fixpipe Buffer |
A physical storage unit in the AI Core, which is used to store data such as quantization parameters required during Fixpipe movement. It corresponds to the logical memory AscendC::TPosition::C2PIPE2GM. |
Global Memory/GM |
Main memory of the device, which is the external storage of the AI Core. It is used to store large-scale data. However, the access approach needs to be optimized to improve performance. |
GlobalTensor |
A tensor that stores global data in the global memory. |
Host |
A host refers to the x86 or Arm server connected to the device. The host utilizes the NN compute capabilities provided by the device to implement services. |
ICache |
Instruction cache. which is used to cache recently or frequently used instructions. When optimizing performance to the extreme, you need to pay attention to how to reduce instruction cache misses. |
InferShape |
Operator shape inference, which is used only in GE graph mode. During the actual network model generation, the tensor shape and datatype are inferred first. In this way, the data type and shape of each tensor can be known before the graph is run, and the correctness of each tensor can be verified in advance. In addition, the output tensor description of the operator is inferred in advance, including the tensor shape, data type, and data format. In the preparation phase of operator graph construction, memory can be statically allocated to all tensors to avoid overhead caused by dynamic memory allocation. |
Kernel |
Kernel function. It is a parallel function executed on the device. A kernel function is modified by __global__. Multiple kernels execute the same kernel function in parallel. The main difference is that different kernel functions have different block IDs during execution. |
Kernel Launch |
The process of submitting a kernel program to the hardware for execution. |
L0A Buffer |
A physical storage unit inside the AI Core, which is used to store the left matrix for matrix computation. It corresponds to the logical memory AscendC::TPosition::A2. |
L0B Buffer |
A physical storage unit inside the AI Core, which is used to store the right matrix for matrix computation. It corresponds to the logical memory AscendC::TPosition::B2. |
L0C Buffer |
A physical storage unit inside the AI Core, which is used to store the result of matrix computation. It corresponds to the logical memory AscendC::TPosition::CO1. |
L1 Buffer |
A physical storage unit inside the AI Core, which has a relatively large space and is used to cache the input data for matrix computation. Generally, the input data for matrix computation needs to be moved from the GM to the L1 Buffer, and then to the L0A Buffer and L0B Buffer. The L1 Buffer corresponds to the logical memory AscendC::TPosition::A1 and AscendC::TPosition::B1. |
L2 Cache |
A level-2 cache, which is used to store frequently accessed data to reduce the read/write operations on the global memory. |
LCM |
Local cache memory. AscendC::TPosition::LCM represents the temporary shared unified buffer space, which implements the same function as VECCALC. |
Local Memory |
Internal storage of the AI Core, including storage units such as L1 Buffer, L0A Buffer, L0B Buffer, L0C Buffer and Unified Buffer. |
LocalTensor |
A tensor that stores the local data in the local memory of the AI Core. |
Mask |
It is used to control the number of elements involved in vector computation in each repeat. The elements can be set in continuous mode or bit-by-bit mode. |
MTE1 |
Memory transfer engine 1, which is the data transfer engine of the AI Core and is responsible for transferring data from the L1 Buffer to the L0A Buffer or L0B Buffer. Note: The hardware capabilities may vary. |
MTE2 |
Memory transfer engine 2, which is the data transfer engine of the AI Core and is responsible for transferring data from the GM to the L1 Buffer, L0A Buffer, L0B Buffer, or Unified Buffer. Note: The hardware capabilities may vary. |
MTE3 |
Memory transfer engine 3, which is the data transfer engine of the AI Core and is responsible for transferring data from the Unified Buffer to the Global Memory or L1 Buffer. Note: The hardware capabilities may vary. |
NC1HWC0 |
A five-dimensional data format, where C0 is closely related to the hardware architecture. This format can improve the computation efficiency of matrix multiplication. |
NCHW |
Feature map data is stored in the order of [Batch, Channels, Height, Width]. |
ND |
A common format, which is an N-dimensional tensor. |
NHWC |
Feature map data is stored in the order of [Batch, Height, Width, Channels]. |
NPU |
Neural-network processing unit It uses the data-driven parallel computing architecture and is dedicated to processing a large number of computing tasks in AI applications. |
OP |
An operator (OP) is a basic unit that performs specific mathematical operations or operations in a deep learning algorithm, such as the activation function (ReLU), convolution (Conv), pooling, and normalization (Softmax). These operators can be combined to build a neural network model. |
OpType |
A general term for a type of operators. For example, there may be multiple Add operators on a network, named Add1 and Add2 respectively. However, the OpType of these operators is Add. |
Pipe |
A core concept of the Ascend C programming paradigm, which is used to manage resources such as device memory in a unified manner. One kernel function can initialize only one pipe object. |
Preload |
Before a computing task starts, necessary instructions or data are loaded to the cache to reduce the instruction or data access latency and improve the computing efficiency. |
Reduce |
A dimension reduction operation is used to reduce the dimensions of a multi-dimensional tensor. Common dimension reduction operations include summation, averaging, maximum value calculation, and minimum value calculation. |
Repeat |
Each time a vectorized computation instruction is executed, eight data blocks are read for computation. This is called a repeat. Generally, the data needs to be read and computed for multiple times in a loop. |
Repeat Stride |
Number of data blocks between the start address of the next repeat and the start address of the current repeat when the vectorized computation instruction is executed in a loop. |
Repeat Times |
Number of times that the vectorized computation instruction is executed in a loop. |
Scalar |
The scalar unit is a scalar computation unit on the AI Core. It is responsible for scalar data computation and instruction transmission to other units (such as the MTE, Vector Unit, and Cube Unit). |
SPMD |
Single-Program Multiple-Data (SPMD) is a parallel programming model. It uses the same program to execute on multiple cores in parallel, but each core processes different data. |
Tensor |
A tensor is an N-dimensional data structure that serves as a container for operator computation data. The most common tensors are scalars, vectors, and matrices. The elements of a tensor can include integer values, floating point values, or string values. |
Tiling |
Tiling refers to data partitioning and blocking. For large-scale data computation, multi-core tiling is required, and each core needs to be partitioned into multiple blocks for multiple computations. |
TilingData |
TilingData refers to the parameters related to data tiling and blocking (such as the size of the block to be moved each time and the number of loops). Due to the limited scalar computing capability of the device, tiling parameters are generally computed on the host and then transferred to the device for the kernel functions to use. |
TilingFunc |
Default function provided by an operator project for tiling computation on the host. |
TilingKey |
It is used to distinguish the special implementation of different versions of the kernel function. Different tiling keys will generate different binaries. |
TPosition |
When managing physical memory at different levels, the Ascend C uses an abstract logical position (TPosition) to represent the storage at each level, instead of using the concept of on-chip physical storage, to hide the hardware architecture. TPosition types include VECIN, VECOUT, VECCALC, A1, A2, B1, B2, CO1, and CO2. VECIN, VECCALC, and VECOUT are used for vector programming, while A1, A2, B1, B2, CO1, and CO2 are used for matrix programming. |
AscendC::TPosition::TSCM represents the logical memory corresponding to the L1 buffer space. It needs to be managed by developers to efficiently use hardware resources and is mainly used for Matmul computation. For example, developers can cache a copy of TSCM data and flexibly configure it as matrix A, matrix B, or bias matrix for Matmul operations in different scenarios, thereby implementing memory reuse and optimizing the computation efficiency. |
|
Unified Buffer/UB |
Internal storage unit of the AI Core, which is mainly used for vector computation and corresponds to the logical memory AscendC::TPosition::VECIN, AscendC::TPosition::VECOUT, and AscendC::TPosition::VECCALC. |
VECCALC |
Vector calculation. AscendC::TPosition::VECCALC represents the logical memory used for vector computation on the device. It is used to store temporary variables, and the physical storage corresponds to the unified buffer of the AI Core. |
VECIN |
Vector input. AscendC::TPosition::VECIN represents the logical memory used for vector computation on the device. It is used to store the input data of vector computation, and the physical storage corresponds to the unified buffer of the AI Core. |
VECOUT |
Vector output. AscendC::TPosition::VECOUT represents the logical memory used for vector computation on the device. It is used to store the output data of vector computation, and the physical storage corresponds to the unified buffer of the AI Core. |
Vector |
Vector unit on the AI Core, which is responsible for performing vector operations. Compared with the Cube Unit, the Vector Unit offers less robust compute power but more flexible computations (such as the reciprocal and square root in mathematics). |
Vector Core |
Vector core, which is dedicated to vector computation and consists of the Scalar scheduling unit, vector unit, and transfer unit, excluding the Cube Unit. |
Workspace |
Generally, it refers to a pre-allocated and temporarily used global memory, which is used to store intermediate results or temporary data. |
Debugging on the CPU |
A twin debugging method provided by Ascend C. This method simulates the execution and debugging of kernel functions on the device side on the CPU, and only debugs the operator functionality and accuracy. |
Basic block |
Typical size of a data block required for a computation. |
Kernel Launch |
A simple and direct way to call a kernel. After the operator implementation on the kernel side and tiling implementation on the host side are completed, you can use the runtime API to directly debug the operator kernel. In this simple and direct method that enables quick verification of operator functions, tiling development is not restricted by the CANN framework. |
NPU domain debugging |
A twin debugging method provided by Ascend C, which refers to debugging based on the NPU simulation software or NPU hardware. |
Tiling offload |
Tiling offload refers to offloading tiling computation to the AI CPU on the device side, so that the entire computation can be efficiently completed on the device side. |
Separation mode |
A working mode of the AI Core. In this mode, the matrix and vector units are separately scheduled by independent scalar scheduling units and are separately deployed on the Cube Core and Vector Core. The Cube Core and Vector Core are combined in a certain ratio (1:N). This combination is considered as one AI Core, and the number of cores of the AI Core is subject to that of the Cube Core. |
Twin Debugging |
An operator debugging method provided by Ascend C. It supports debugging of the precision in the CPU domain and the precision and performance in the NPU domain. |
Pipeline task |
The Ascend C programming paradigm is a pipeline programming paradigm. The processing program in the operator core is divided into multiple pipeline tasks. A pipeline task refers to a parallel task scheduled by the main program in a single-core processing program. Inside a kernel function, pipeline tasks can be used to implement parallel data processing, further improving performance. |
Continuous mode |
A mode that can be selected when the mask is used to control the elements involved in the vector computation in each repeat. It indicates the number of consecutive elements involved in the computation. |
Coupling mode |
A working mode of the AI Core. In this mode, the same scalar scheduling unit is used to schedule the matrix and vector units at the same time, and all units are deployed on the same AI Core. |
Fused Operator |
A fused operator is formed by multiple independent small operators, and has equivalent functions as these operators, but usually with better performance. Vector and Cube operators can be fused based on specific algorithms to achieve performance benefits. |
Operator Graph Integration |
Operator graph refers to running operators in GE graph mode. In this mode, all operators are first constructed into a graph, and then the graph is delivered to the Ascend AI Processor for execution through GE. |
Prototype |
An operator prototype is an abstract description of an operator, defining the inputs, outputs, and attributes of the operator. |
MC2 Operator |
An operator for general-purpose computing and communication fusion is an operator that fuses collective communication tasks and computing tasks. During operator execution, computing and communication tasks can be partially pipelined in parallel, thereby improving performance. |
Bitwise mode |
A mode that can be selected when the mask is used to control the elements involved in the vector computation in each repeat. It can control the elements involved in the computation by bit. The value 1 of a bit indicates that the element is involved in the computation, and the value 0 indicates that the element is not involved in the computation. |
Custom operator project |
An operator project generated by the msOpGen tool and provided by Ascend C. |