Ascend Glossary
A
Term/Acronym/Abbreviation |
Definition |
|---|---|
AccDECS |
Accelerator for Device-Edge-Cloud Synergy |
accumulated relative error |
Accumulated relative error algorithm. An accuracy comparison algorithm with a result ranging from 0 to infinity. A value closer to 0 indicates higher similarity, while a larger value indicates a greater discrepancy. |
accuracy comparison |
Accuracy comparison. The process of comparing dump data generated on an NPU with the Ground Truth (.npy data generated on a GPU/CPU). It is used to analyze the output differences between proprietary operators and industry-standard operators. |
ACP |
async checkpoint persistence |
AI |
artificial intelligence A technical science that researches and develops theories, methods, technologies, and application systems to simulate, extend, and expand human intelligence. |
AI Core |
The computing core of the Ascend AI Processor, responsible for executing compute-intensive matrix, vector, and scalar tasks. Operators developed using the Ascend C programming language run on AI Cores. |
AI CPU |
A general-purpose CPU provided on the Ascend AI Processor, primarily responsible for executing AI CPU operators and scheduling deterministic tasks. |
AIPP |
artificial intelligence pre-processing A feature used to perform image preprocessing on AI Cores, including image resizing, color space conversion (CSC), and mean subtraction/multiplication (pixel adjustment), prior to model inference. |
AMCT |
Ascend Model Compression Toolkit A deep learning model compression toolkit optimized for Ascend AI Processors. It provides features such as quantization and tensor decomposition to reduce model size. Once deployed on Ascend AI Processors, compressed models enable low-bit operations, improving computing efficiency and overall performance. |
AMP |
asymmetric multiprocessing A multiprocessing architecture where multiple processors exist, and each CPU is assigned a specific task at any given time. Before symmetric multiprocessing matured, it was used as a software workaround to enable multiple processors to run simultaneously. Even with the advent of symmetric multiprocessing, asymmetric multiprocessing remains a simpler and more cost-effective software option for certain applications. |
AMP |
automatic mixed precision A technique in deep learning to accelerate training and improve efficiency. It achieves this by combining different numerical precisions, typically low-precision floating-point formats (for example, FP16) and high-precision formats (for example, FP32). |
AOE |
Ascend Optimization Engine A tool that encapsulates Ascend Tensor Compiler (ATC) compilation and AscendCL runtime service interfaces to provide parallel tuning capabilities. |
AOL |
Ascend Operator Library |
ARM |
Advanced RISC Machine The first RISC microprocessor designed by Acorn Computers Ltd. for the budget market. While the ARM processor features a 32-bit design, it also supports a 16-bit instruction set, which typically reduces code size by up to 35% compared to equivalent 32-bit code while retaining all 32-bit system advantages. |
ARP |
Address Resolution Protocol An internet protocol used to map IP addresses to MAC addresses, allowing hosts and routers to determine link-layer addresses through ARP requests and responses. |
AscendCL |
Ascend Computing Language It provides APIs for runtime management, single-operator calling, model inference, and media data processing. It enables developers to utilize underlying hardware resources on the CANN platform for deep learning inference, image/video preprocessing, and accelerated single-operator computation. |
Ascend EP |
Ascend Endpoint It refers to the Ascend AI Processor operating in PCIe endpoint mode. In this setup, the host acts as the root complex and the device as the endpoint (EP). AI applications run on the host system, while the Ascend AI Processor is connected as a PCIe endpoint device. The host interacts with the device through PCIe to load and run AI tasks. The Device provides neural network (NN) computing power to the host (x86, Arm, etc.), and its CPU resources are accessible only through the host. |
Ascend IR |
Ascend Intermediate Representation. An abstract data structure specific to Ascend AI Processors used to represent computation flows. In Ascend documentation, "IR" refers to Ascend IR unless otherwise specified. |
Ascend RC |
Ascend Root Complex It refers to the Ascend AI Processor operating in PCIe root complex mode. In this mode, the product's CPU directly runs AI service software, and external peripherals such as IP cameras, I2C sensors, and SPI displays are connected as endpoint devices. |
ASLR |
address space layout randomization |
ATB |
Ascend Transformer Boost. An acceleration library based on the Ascend AI Processor, specifically designed for the training and inference of Transformer models. |
ATC |
Ascend Tensor Compiler A model conversion tool within the CANN heterogeneous computing architecture. It converts network models from open-source frameworks and Ascend IR-defined operator descriptions (in JSON format) into offline models (.om format) supported by Ascend AI Processors. During conversion, ATC optimizes operator scheduling, weight data rearrangement, and memory usage to ensure high-performance execution in deployment scenarios. |
AVI |
Ascend Virtual Instance It refers to the use of resource virtualization technology to partition a single NPU into multiple virtual NPU (vNPU) instances. These instances can be mounted to virtual machines or containers, allowing one NPU to support multiple concurrent compute tasks. By partitioning compute resources, AVI enables virtualized reuse and ensures secure isolation, significantly reducing the cost and entry barrier for NPU utilization while supporting on-demand multi-tenant resource management. |
B
Term/Acronym/Abbreviation |
Meaning |
|---|---|
backend |
A module that interfaces the backend of the inference serving framework with the model inference layer. |
batch |
A set of samples used in a single iteration of model training (that is, one gradient update). |
batch size |
The number of samples processed in a single batch. |
BIU |
bus interface unit The interface through which the AI Core interacts with the system bus. |
BIOS |
basic input/output system Firmware stored on a computer motherboard that includes basic I/O control programs, power-on self-test (POST) routines, bootstrap loaders, and system configuration settings. It provides low-level hardware configuration and control functions. |
BLAS |
Basic Linear Algebra Subprogram A set of software building blocks that provides optimized routines for performing basic vector and matrix operations in high-performance computing (HPC). |
BOM |
bill of materials A comprehensive document used in manufacturing that lists the raw materials, primary/secondary processing flows, component breakdowns, and quantities of semi-finished and finished goods. It serves as a key reference for communication between OEMs and partners or across internal departments. |
BP Point |
Back propagation point, the endpoint of the backward operators within a training network iteration trajectory. |
C
Term/Acronym/Abbreviation |
Meaning |
|---|---|
CA |
Certificate Authority |
CANN |
Compute Architecture for Neural Networks CANN is a heterogeneous compute architecture developed by Ascend for AI scenarios. It serves as a critical bridge by supporting various AI frameworks at the upper layer and managing AI processors and programming at the lower layer. As a key platform for enhancing the computing efficiency of Ascend AI Processors, CANN provides efficient and easy-to-use programming interfaces for diverse application scenarios, enabling developers to rapidly build AI applications and services on the Ascend platform. |
CC |
cluster computing |
CCAE |
Cluster Computing Autonomous Engine |
CNN |
convolutional neural network A type of feedforward neural network in which artificial neurons respond to surrounding units, making it highly effective for large-scale image processing. |
Cosine Similarity |
Cosine similarity algorithm. An accuracy comparison algorithm with a result range of [-1, 1]. A value closer to 1 indicates higher similarity between the two sets of data, while a value closer to -1 indicates that they are diametrically opposed. |
Cube |
A computing unit within the AI Core responsible for matrix operations. In a single execution, the Cube unit can complete the multiplication of two 16 × 16 matrices of the FP16 data type. |
container |
A form of operating system virtualization. It is used to run everything from small microservices or software processes to large-scale applications. A container includes all necessary executables, binary code, libraries, and configuration files required for operation. |
CPU |
central processing unit |
CRI |
container runtime interface |
CRD |
custom resource definition |
controller |
The management core and decision-making "brain" of the cluster. It manages the operational status of all "Server" services within the cluster, handles PD identity management and decision-making, and governs resource management policies. |
coordinator |
The entry point for user inference requests. It receives high-concurrency inference requests and performs request scheduling, management, and forwarding, serving as the data request gateway for the entire cluster. |
CP |
context parallelism A technique that employs data partitioning to split a long input sequence into multiple sub-sequences based on cp_size. Attention is calculated in blocks in parallel, and Key-Value (KV) data is exchanged between adjacent ranks using a ring topology. This optimizes first-token performance for long sequences and is ideal for accelerating P-node processing in long-sequence input scenarios. |
CertTools |
A set of tools used to generate, configure, encrypt, and manage certificates and keys for MindIE serving, including certificate generation, certificate/key importation, and key encryption operations. |
D
Term/Acronym/Abbreviation |
Meaning |
|---|---|
daemon |
In Linux/Unix systems, a daemon is a background system service process that runs independently of a controlling terminal. It typically starts during system boot and terminates when the system shuts down. |
DataFlow |
A complete computational flow consisting of one or more processing points organized through data queues in a data-driven manner. |
DCMI |
Davinci Card Management Interface |
DDP |
distributed data parallel |
DDR |
double data rate Strictly speaking, it is double data rate synchronous dynamic random access memory (DDR SDRAM). Developed from the SDRAM architecture, DDR allows manufacturers to produce memory with minimal modifications to existing equipment, effectively reducing costs. A DDR memory is developed based on an SDRAM memory, and still uses an SDRAM production system. Unlike traditional single data rate memory, DDR technology performs two read/write operations per clock cycle, one on the rising edge and one on the falling edge of the clock signal. |
DECS |
device-edge-cloud synergy |
DL |
deep learning A branch of machine learning that utilizes algorithms consisting of multiple processing layers with complex structures or multiple non-linear transformations to create high-level abstractions of data. |
DMA |
direct memory access A critical feature of modern computing that allows hardware devices of varying speeds to communicate directly with memory, bypassing the CPU to reduce heavy interrupt overhead. |
DP |
data parallelism A common parallel strategy in large-scale deep learning training where each process (device) maintains a complete copy of the model and its parameters but processes a different subset of the data. |
DPC |
Distributed Parallel Client |
DRAM |
dynamic random access memory A type of primary computer memory used to temporarily store data and instructions required by the CPU for processing. |
DSL |
domain specific language An operator development method where developers express the computational logic through DSL interfaces. Subsequent tasks such as operator scheduling, optimization, and compilation are automatically handled by existing interfaces. |
DSCP |
differentiated services code point Based on the Diff-Serv QoS classification standard, DSCP uses 6 bits of the Type of Service (ToS) byte in the IP header to differentiate traffic priorities. It combines the IP Precedence and Type of Service fields to maintain backward compatibility with older routers. Each DSCP value maps to a defined Per-Hop Behavior (PHB), allowing end devices to mark and classify traffic. |
bandwidth |
The range of frequencies that a transmission line or channel in a network can carry. It is the difference between the highest and lowest frequencies of the channel. Greater bandwidth typically results in faster data transmission rates. |
single-operator comparison |
A method of tensor comparison within accuracy comparison tools. It involves selecting one or more specific operators in a network model to analyze their computational accuracy. |
E
Term/Acronym/Abbreviation |
Meaning |
|---|---|
ECC |
error checking and correction A technology that adds check bits to the original data bits to detect and correct data errors. |
eMMC |
embedded MultiMediaCard A managed flash memory storage system. It features an external interface similar to an SD card, internal flash storage media, and an integrated bad block management system. |
epoch |
One complete pass of the entire dataset through the training algorithm. |
EULA |
End User License Agreement |
ESN |
equipment serial number A unique string that identifies a device. It is a critical key for binding licenses to specific hardware, also known as a "device fingerprint." |
EndPoint |
An inference serving protocol and API wrapper compatible with third-party framework interfaces such as Triton, OpenAI, TGI, and vLLM. |
EP |
Expert parallelism. A model parallelism technique that partitions parameters by assigning different experts in a Mixture of Experts (MoE) model to different devices. A gating mechanism routes inputs to specific experts, activating the corresponding devices. |
F
Term/Acronym/Abbreviation |
Definition |
|---|---|
Faiss |
An open-source library developed by Meta (formerly Facebook) for efficient similarity search and clustering of dense vectors. |
FEC |
forward error correction A digital signal processing technique used to enhance data reliability by introducing redundant data, allowing the receiver to detect and correct errors during data transmissions. |
FFT |
fast Fourier transform An algorithm that computes the discrete Fourier transform (DFT) or its inverse (IDFT). It converts signals from their original domain (often time or space) to the frequency domain, and vice versa. |
FFTS |
function flow task scheduler A data-flow-driven parallel scheduling mechanism. It utilizes a subgraph data management unit (DMU) mechanism to eliminate unnecessary direct memory access (DMA) copy overhead and provides sub-task threading and inter-thread scheduling to maximize hardware parallelism across AI Cores or AI Vectors, achieving effective operator fusion. |
Flash Attention |
An IO-aware, exact attention algorithm used for model acceleration. It speeds up attention computations and reduces memory footprint without approximation. It is widely implemented in LLMs such as Llama and GPT-3. |
FLOPS |
floating-point operations per second A measure of computer performance, particularly in scientific computing fields involving heavy floating-point calculations. Note that the "S" stands for "second" and is not a plural indicator, thus it should never be omitted. |
FP Point |
Forward-propagation point. The starting position of forward operators within a training network's iteration trajectory. |
FUSE |
Filesystem in Userspace An operating system mechanism that allows non-privileged users to create their own file systems without editing kernel code. It is supported in Linux through a kernel module and utilized by file systems like ZFS, GlusterFS, and Lustre. |
G
Term/Acronym/Abbreviation |
Definition |
|---|---|
GDAT |
gradient auto tuning An optimization tool that minimizes communication tail latency by maximizing the parallelism between backward computation and gradient aggregation. In distributed training, fusion strategies for gradient aggregation operators impact the communication overhead after backward passes, thereby affecting overall cluster performance and scaling linearity. |
GDB |
GNU debugger. A standard debugging tool for monitoring the internal execution of programs or analyzing crashes. GDB supports the following four main operations to help locate defects:
|
GE |
graph engine A core component providing graph/operator intermediate representation (IR) as a secure and intuitive interface for model building. It allows users to build network models, define computational graphs and operators, and configure associated attributes. |
GM |
Global memory. The main memory on the device side. It serves as the external storage for the AI Core and is used for large-scale data, requiring optimized access patterns to maximize throughput. |
gRPC |
Google Remote Procedure Call |
GRPO |
Group Relative Policy Optimization A reinforcement learning (RL) algorithm designed to enhance reasoning capabilities in LLMs. Unlike traditional RL methods that rely on external value functions, GRPO optimizes models by evaluating relative performance within groups of generated responses, significantly improving training efficiency. |
management plane |
The architectural layer or network segment where health status and monitoring interfaces reside. |
H
Term/Acronym/Abbreviation |
Definition |
|---|---|
HCC |
Huawei Compiler Collection |
HCCL |
Huawei Collective Communication Library A library providing high-performance collective communication functions for distributed deep learning across multiple servers. |
HCCP |
Huawei Collective Communication Adaptive Protocol A protocol layer providing cross-NPU communication capabilities while abstracting away differences in underlying transport protocols for upper-layer applications. |
HCCS |
Huawei Cache Coherence System A system designed for high-speed interconnect between CPUs and NPUs. |
HDC |
host-device communication A communication module deployed on both the host and device sides to facilitate data exchange between them. |
HDK |
hardware developer kit |
HDR |
high dynamic range A technique used in imaging and audio to reproduce a greater range of luminosity or signal levels than standard digital techniques. |
HPA |
HorizontalPodAutoscaler A Kubernetes feature that automatically scales the number of Pods in a workload (such as a Deployment or StatefulSet) based on observed CPU utilization or other selected metrics. |
HPO |
hyperparameter optimization The process of automating the search for the optimal or near-optimal hyperparameters of a machine learning model, replacing manual tuning with algorithmic search strategies. |
I
Term/Acronym/Abbreviation |
Definition |
|---|---|
ICS |
Intellectual Collaborative Service |
IOPS |
input/output operations per second An input/output performance measurement used to characterize computer storage devices. |
IPC |
IP camera |
ISP |
image signal processing A method or specialized hardware unit used to process raw data from image sensors to render a high-quality digital image, ensuring compatibility across different sensor manufacturers. |
ISV |
independent software vendor |
J
Term/Acronym/Abbreviation |
Definition |
|---|---|
JDK |
Java Development Kit A software development environment used for developing Java applications, containing a collection of tools and libraries. |
JPEGD |
JPEG decoder A specialized hardware or software module that provides the capability to decode images from the JPEG format. |
JPEGE |
JPEG encoder A specialized hardware or software module that provides the capability to encode images into the JPEG format. |
K
Term/Acronym/Abbreviation |
Definition |
|---|---|
KMC |
Key Management CBB A module designed to facilitate code sharing and simplify development. It implements core functions such as encrypted key storage and encryption/decryption to enable rapid product integration. |
KMC |
Key Management Center A centralized system used to manage and protect cryptographic keys. It provides secure key storage, distribution, rotation, backup, and recovery. The KMC keystore ensures key security and reliability while supporting multiple encryption algorithms and key lengths across various application scenarios. |
KL divergence |
Kullback-Leibler divergence An accuracy comparison algorithm used to measure the difference between two probability distributions. Values range from 0 to infinity. A lower KL divergence indicates a closer match between the true and approximate distributions. |
Kubernetes |
An open-source system for automating the deployment, scaling, and management of containerized applications. It provides a platform for automated deployment, scaling, and operation of application containers across clusters of hosts. |
KASLR |
Kernel address space layout randomization. A security mechanism that randomizes the memory address layout of the kernel, increasing the difficulty of exploiting kernel vulnerabilities. |
L
Term/Acronym/Abbreviation |
Definition |
|---|---|
L0A buffer |
An internal physical storage unit within the AI Core, typically used to store the left matrix for matrix multiplication. It corresponds to the logical memory AscendC::TPosition::A2. |
L0B buffer |
An internal physical storage unit within the AI Core, typically used to store the right matrix for matrix multiplication. It corresponds to the logical memory AscendC::TPosition::B2. |
L0C buffer |
An internal physical storage unit within the AI Core, typically used to store the results of matrix computation. It corresponds to the logical memory AscendC::TPosition::CO1. |
L1 buffer |
An internal physical storage unit within the AI Core with a relatively large capacity, typically used to cache input data for matrix multiplication. Input data is generally moved from global memory (GM) to the L1 buffer, and then to the L0A and L0B buffers. It corresponds to logical memory AscendC::TPosition::A1 and AscendC::TPosition::B1. |
L2 cache |
level 2 cache A secondary CPU cache used to provide faster access to frequently used data and instructions before accessing the main memory. |
LLDP |
Link Layer Discovery Protocol A layer 2 discovery protocol defined in IEEE 802.1ab. It enables network management systems to quickly acquire layer 2 network topology and change information as the network scales. |
LLM |
large language model A type of language model consisting of artificial neural networks with a massive number of parameters (typically billions or more), trained on large datasets of unlabeled text using self-supervised or semi-supervised learning. |
local memory |
The internal storage of the AI Core, including storage units such as the L1 buffer, L0A buffer, L0B buffer, L0C buffer, and unified buffer. |
loss |
The deviation between predicted values and actual values, serving as a primary metric in deep learning to evaluate model performance. |
LTO |
link time optimization A type of program optimization performed by a compiler during the linking stage. |
adjacency list |
A common data structure in graph theory and computer science used to represent a graph, where each vertex stores a list or array of all other vertices to which it is connected. |
LoRA |
low-rank adaptation A parameter-efficient fine-tuning (PEFT) method for large-scale models. |
M
Term/Acronym/Abbreviation |
Definition |
|---|---|
MAC |
media access control A data link layer protocol that manages how multiple devices share a common transmission medium to prevent data collisions. |
Max Absolute Error |
maximum absolute error An accuracy comparison algorithm with a range from 0 to infinity. A value closer to 0 indicates higher similarity, while a larger value indicates a greater discrepancy. |
Max RelativeError |
maximum relative error An accuracy comparison algorithm with a range from 0 to infinity. A value closer to 0 indicates higher similarity, while a larger value indicates a greater discrepancy. |
MCU |
microcontroller unit An integrated circuit that integrates multiple functional modules such as the processor, memory, and input/output interfaces. |
Mean Absolute Error |
mean absolute error An accuracy comparison algorithm with the result ranging from 0 to infinity.
|
Mean Relative Error |
mean relative error An accuracy comparison algorithm with results ranging from 0 to infinity. A result closer to 0 indicates higher similarity. |
MemFS |
memory file system |
MindIE |
Mind Inference Engine A high-performance deep learning inference framework optimized for Ascend hardware, supporting acceleration, debugging, tuning, and rapid deployment. |
MindFormers |
MindSpore Transformers. An end-to-end suite based on the MindSpore framework. It supports the entire lifecycle of LLMs, including training, fine-tuning, evaluation, and deployment. |
MindIO |
A memory-based caching system designed to accelerate the read/write speeds of training checkpoints. |
MinIO |
An object storage service component. |
MLP |
multilayer perceptron A feed-forward artificial neural network consisting of an input layer, one or more hidden layers, and an output layer. MLP can be used to solve a variety of problems, such as classification and regression. Due to its powerful representational capabilities, MLP is widely applied in many fields, including image recognition, natural language processing, and more. |
MoE |
Mixture of experts. It is a technology used to train models with trillions of parameters. MoE decomposes predictive modeling tasks into several sub-tasks, training an expert model for each sub-task and developing a gating model. This gating model assigns one or more experts based on the input data, and finally integrates the calculation results from multiple experts to produce the prediction result. |
msDebug |
An operator debugging tool. It provides native environment debugging on Ascend processors, featuring flexible variable inspection and step-by-step execution. |
msKPP |
A performance modeling and tuning tool designed for operator theoretical performance and template libraries. In the performance modeling phase, the tool utilizes built-in operator API performance data, enabling users to express implementation algorithms and evaluate performance during the initial design stage. In the template library tuning phase, it provides capabilities for the generation, compilation, and execution of template library kernel dispatch code. Additionally, it supports code replacement within the kernel combined with automatic performance tuning. |
msProf |
An operator profiling tool. It collects performance data from both hardware and simulation, visualized through MindStudio Insight to identify performance bottlenecks. |
msproftx |
msProf tool extension. An extension for the MindStudio system profiling tools. |
msSanitizer |
An operator anomaly detection tool. It provides memory detection and contention detection capabilities, supporting precise localization of memory issues in multi-core programs. |
MTE |
memory transfer engine Also known as the load-store unit (LSU), it manages data read/write between different buffers within the AI Core and handles format conversions. |
MTE1 |
Memory transfer engine 1. Tiered memory transfer engine responsible for data movement from the L1 buffer to the L0A buffer or L0B buffer based on hardware capabilities. |
MTE2 |
Memory transfer engine 2. Tiered memory transfer engine responsible for data movement from the global memory to the L1 buffer, L0A buffer, L0B buffer, or unified buffer based on hardware capabilities. |
MTE3 |
Memory transfer engine 3. Tiered memory transfer engine responsible for data movement from the unified buffer to the global memory or L1 buffer based on hardware capabilities. |
MTU |
maximum transmission unit The maximum data packet size that can be transmitted over a network. The size varies depending on the network type. For example, it is 576 bytes in X.25 networks, 1500 bytes in Ethernet, and 17,914 bytes in 16Mbit/s Token Ring. The MTU size is determined by the link layer of the network. When a packet is transmitted across a network, the path MTU (PMTU) determines the smallest packet size among all involved networks, representing the maximum packet size that can be transmitted across the entire path without fragmentation. |
MindIE SD |
Mind Inference Engine Stable Diffusion, a suite of visual generation inference models within the MindIE ecosystem. |
MindIE Turbo |
Mind Inference Engine Turbo, an acceleration plugin library developed for LLM inference on Ascend hardware. |
MindIE Motor |
Mind Inference Engine Motor, a request scheduling framework specifically designed for LLM PD (prefill-decode) disaggregation inference. It provides inference serving capabilities through an open and extensible platform architecture, and interfaces downstream with MindIE LLM to meet the high-performance inference requirements of large language models. |
MindIE LLM |
Mind Inference Engine Large Language Model, the dedicated inference component for large language models within the MindIE framework. |
MLA |
Multi-head Latent Attention, an efficient attention mechanism that uses low-rank KV joint compression to eliminate KV cache bottlenecks during inference. |
MTP |
Multi-token prediction, a parallel decoding method introduced by DeepSeek to generate multiple tokens in a single step. The core logic is that the model does not limit itself to predicting only the next single token. Instead, it predicts multiple subsequent tokens simultaneously, significantly accelerating model generation speeds. |
MindIE Service Tools |
A toolset for Ascend inference services, featuring performance/accuracy testing, visualization, automated optimization, and configurable throughput. |
MindIE Simulator |
An automated service performance tuning tool that simulates various strategies to find optimal parameters under latency constraints. |
N
Term/Acronym/Abbreviation |
Definition |
|---|---|
NCS |
Neural Compute Server NCS encapsulates AscendCL runtime service interfaces to accept remote hardware execution requests and return corresponding performance data. |
NIC |
network interface controller Also known as network interface card, network adapter, LAN adapter, or other similar terms. It refers to a hardware component that connects a computer to a computer network. |
NLP |
natural language processing A subdiscipline of artificial intelligence and linguistics that explores how to process and utilize natural language. NLP involves various aspects and stages, primarily including cognition, understanding, and generation. |
NN |
neural network In the fields of machine learning and cognitive science, a neural network is a mathematical model or computing model that emulates the structure and functions of a biological neural network. |
NPU |
Neural-Network Processing Unit. Utilizing a "data-driven parallel computing" architecture, it is specifically designed to handle massive computational tasks in AI applications. |
NUMA |
non-uniform memory access NUMA is a distributed memory access architecture where processors can access different memory addresses simultaneously to significantly enhance parallelism. In this mode, processors are divided into multiple nodes, with each node allocated its own local memory space. While processors in any node can access all physical memory, the latency for accessing local memory is much lower than that for accessing remote nodes. |
NVMe |
Non-Volatile Memory Express A logical device interface specification. It is a bus transport protocol based on a logical device interface (equivalent to the application layer in communication protocols), used to access non-volatile storage media (such as flash-based SSDs) attached through the PCI Express (PCIe) bus. |
O
Term/Acronym/Abbreviation |
Definition |
|---|---|
OM |
offline model |
ONNX |
Open Neural Network Exchange ONNX is an open-source file format designed for machine learning to store trained models. It enables different AI frameworks to share and exchange model data using a unified format. |
OOM |
out of memory |
OP |
Operator. An operator is the fundamental unit for executing specific mathematical calculations or operations within deep learning algorithms, such as activation functions (for example, ReLU), convolution, pooling, and normalization (for example, Softmax). Neural network models are constructed by combining these operators. |
OPAT |
operator auto-tuning OPAT is an optimizer designed to enhance operator performance. When AOE inputs a full graph into OPAT, OPAT performs operator fusion internally and partitions the fused graph at the operator level. It generates different tuning strategies for each fused operator subgraph to achieve optimal performance, subsequently saving these strategies in the operator knowledge base. |
OpenPGP |
Open Pretty Good Privacy Pretty Good Privacy (PGP) is an encryption program that provides cryptographic privacy and authentication for data communication. It is commonly used for signing, encrypting, and decrypting texts, emails, and files. OpenPGP is a non-proprietary protocol that defines a unified standard for encrypted messages, signatures, private keys, and certificates used for public key exchange. |
OPP |
operator package |
OS |
operating system |
OS |
optimizer state |
OCI |
Open Container Initiative Established by the Linux Foundation in June 2015, the OCI aims to create open industry standards for container formats and runtimes. |
OM Adapter |
Reports MindIE heartbeats, alarms, resource information, and logs to external alarm and management platforms, enabling service status monitoring and integration with management systems. |
P
Term/Acronym/Abbreviation |
Definition |
|---|---|
PCIe |
Peripheral Component Interconnect Express A high-speed serial expansion bus standard commonly used for peripheral expansion in computer systems. |
PCB |
printed circuit board |
PFC |
priority-based flow control A flow control mechanism based on priorities. |
PMU |
performance monitor unit A hardware unit provided by the CPU that enables the reading of CPU performance data by accessing relevant registers. |
PNGD |
PNG decoder A component that provides the capability to decode images in PNG format. |
Pod |
The smallest deployable unit that can be created in Kubernetes and a top-level resource type in the Kubernetes REST API. |
PP |
pipeline parallelism A technique that distributes different layers of a model across various computing devices to reduce individual device memory consumption, enabling the training of ultra-large-scale models. |
PWM |
pulse width modulation A modulation technique where the pulse duration (width) of a pulse carrier varies according to the sample values of the modulating wave. |
on-chip memory |
Memory integrated directly onto a microprocessor chip. |
Q
Term/Acronym/Abbreviation |
Definition |
|---|---|
QAT |
quantization-aware training A quantization method that introduces quantization during the retraining process, enhancing the model's robustness to quantization effects through retraining to achieve higher accuracy in the quantized model. |
R
Term/Acronym/Abbreviation |
Definition |
|---|---|
RDMA |
Remote direct memory access, a technology that transfers data directly from the memory of one machine to another without involving the operating systems of either host. It generally refers to a memory access method that spans across a network. |
RED |
relative Euclidean distance An accuracy comparison algorithm. The computation result ranges from 0 to infinity. A result value closer to 0 indicates higher similarity, while a larger result value indicates a greater discrepancy. |
RoCE |
RDMA over Converged Ethernet A network protocol that enables remote direct memory access (RDMA) over Ethernet. There are currently two versions: RoCE v1 and RoCE v2. RoCE v1 is a data link layer protocol that allows communication between any two hosts within the same Ethernet broadcast domain. RoCE v2 is a network layer protocol and its packets can be routed. |
RMSE |
root mean square error An accuracy comparison algorithm. The result ranges from 0 to infinity.
|
Runtime |
Provides applications with functions such as memory management, device management, stream management, event management, and kernel loading and execution specifically for Ascend AI processors. |
RAM |
random access memory A type of semiconductor-based memory that can be read and written by the CPU or other hardware devices. The storage locations can be accessed in any order. |
runC |
A client tool for creating and running containers according to the OCI (Open Container Initiative) specification. |
RoPE |
rotary position embedding A position encoding method that integrates relative position dependencies into self-attention, enhancing the performance of the Transformer architecture. |
RAS |
Reliability, availability, and serviceability. It refers to capabilities that enhance the reliability, availability, and serviceability of prefill-decode (PD) disaggregation services. |
S
Term/Acronym/Abbreviation |
Definition |
|---|---|
scalar |
The scalar computing unit within the AI Core. It is primarily responsible for scalar data operations and issuing instructions to other units, such as the memory transfer engine (MTE), vector unit, and cube unit. |
SDMA |
System direct memory access, also known as direct memory access (DMA). This technology allows peripheral devices to access system memory directly without CPU intervention. |
SiP |
Ascend Signal Processing Boost A signal processing acceleration library that provides a series of high-performance operators for AI models (supporting PyTorch calls) and signal processing (supporting direct C++ calls). |
SGAT |
subgraph auto-tuning SGAT is an optimizer that improves the performance of subgraphs. A complete network can be partitioned into multiple subgraphs. SGAT can be used to generate different tiling policies for those subgraphs. By acquiring performance data for each iteration, the SGAT algorithm identifies the optimal tuning strategy to achieve peak performance for the corresponding subgraph. |
SPI |
serial peripheral interface A synchronous serial communication interface that enables information exchange between the microcontroller unit (MCU) and peripherals. |
SP |
sequence parallelism A parallel computing method that performs column partitioning on input sequences to further improve efficiency on top of tensor parallelism (TP). Since it does not introduce additional communication overhead, it is recommended to enable SP concurrently with TP. |
SRAM |
static random access memory A type of computer memory that is faster and more reliable than common DRAM. It is typically used for caches, registers, and other applications requiring high-speed access. |
SwiGLU |
Swish-Gated Linear Units An activation function variant of gated linear units (GLU) that incorporates the Swish activation function. |
SIMD |
single instruction multiple data A parallel computing architecture where a single instruction processor fetches an instruction and distributes it to multiple processing elements to operate on different data points simultaneously. |
SSL |
Secure Sockets Layer A security protocol operating at the socket layer, situated between the TCP and application layers. It is used for data encryption/decryption and entity authentication. |
standard deviation |
standard deviation An accuracy comparison algorithm with a result ranging from 0 to infinity. A smaller standard deviation indicates lower dispersion, meaning values are closer to the mean. |
STARS |
system task and resource scheduler |
spine-leaf |
A two-tier network topology consisting of leaf switches and spine switches. Spine switches act as the core, typically using high-port-density switches rather than traditional large chassis switches. Leaf switches serve as the access layer, providing connectivity to endpoints and servers while up-linking to the spine switches. This topology is designed to handle rapid traffic growth and large-scale data center expansion, overcoming the limitations of traditional three-tier architectures in high-speed internal interconnection. |
sample-based |
A profiling method where AI Core performance data is collected at fixed time intervals (AI Core-sampling interval). |
step trace |
Iteration trace. It captures start and end times for forward and backward passes, gradient updates, and data augmentation trailing phases. |
ST |
system test A black-box testing phase based on the system requirement specifications, covering all integrated components. It evaluates the complete product system to verify compliance with requirement specifications and identify discrepancies. The scope includes not only the software but also the underlying hardware, peripherals, data, support software, and interfaces, requiring testing within the system's actual operating environment. |
Ascend Virtual Instance (AVI) |
It refers to the use of resource virtualization technology to partition a single NPU into multiple virtual NPU (vNPU) instances. These instances can be mounted to VMs or containers, allowing one NPU to support multiple concurrent tasks. By partitioning compute resources, AVI enables virtualized reuse and ensures secure isolation, significantly reducing the cost and entry barrier for NPU utilization while supporting on-demand multi-tenant resource management. |
SLO |
service-level objective A target value or range of values for a specific service level that is measured over a predefined period. |
T
Term/Acronym/Abbreviation |
Definition |
|---|---|
task-based |
A profiling method where AI Core performance data is collected at the task level. |
TCP |
Transmission Control Protocol |
TDP |
thermal design power The maximum amount of heat generated by a computer chip or component that the cooling system in a computer is designed to dissipate under any workload. |
tensor |
A container for data used in operator computations. It is an N-dimensional data structure, most commonly represented as a scalar, vector, or matrix. Tensor elements can include integers, floating-point values, or strings. |
TFT |
training fault tolerance |
TIK |
Tensor Iterator Kernel An operator development method that allows developers to write custom operators using Python-based APIs provided by TIK. The TIK compiler converts these into binary files compatible with Ascend AI Processor applications. |
TGI |
Text Generation Inference A toolkit for deploying and serving LLMs. TGI enables high-performance text generation for popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, and GPT-NeoX. |
TLS |
Transport Layer Security |
TP |
tensor parallelism A technique that partitions tensors within a network across different devices to reduce memory consumption per device, enabling the training of ultra-large-scale models. |
Triton |
Triton Inference Server An open-source inference serving software that lets teams deploy, run, and scale trained AI models from any framework on any GPU- or CPU-based infrastructure (cloud, data center, or edge). |
TTP |
try to persist |
TPOT |
Time per output token. The latency for each output token, excluding the first token. In offline batch processing applications, TPOT is a critical metric as it determines the overall duration of the inference process. |
TTFT |
Time to first token. A key metric in LLM inference representing the latency from the initial input to the generation of the first output token. |
U
Term/Acronym/Abbreviation |
Definition |
|---|---|
UCE |
uncorrectable memory error |
Unified Buffer |
An internal storage unit within the AI Core, primarily used for vector computations. |
UDF |
user-defined function |
UUID |
universally unique identifier A standard used in software construction and part of the distributed computing environment (DCE) as defined by the Open Software Foundation (OSF). |
UT |
unit test The lowest level of testing performed during software development, where individual units of software are tested in isolation from the rest of the application. |
UDP |
User Datagram Protocol A protocol in the TCP/IP suite that provides a simple interface between the network and application layers. UDP provides unreliable data transfer. Once data is sent to the network layer, no backup is retained. It adds only multiplexing and checksum fields to the IP datagram header. |
V
Term/Acronym/Abbreviation |
Definition |
|---|---|
vcjob |
VolcanoJob, a job type managed by Volcano, a batch scheduling system for Kubernetes. |
VDEC |
video decoder A component that provides the capability to decode video streams of specific formats. |
VENC |
video encoder A component that provides the capability to encode images into video streams of specific formats. |
vector |
The vector computing unit within the AI Core, responsible for performing vector operations. It offers lower computing power than the cube unit but higher flexibility (for example, supporting reciprocal and square root operations). |
vLLM |
An open-source high-throughput serving and memory-efficient inference engine for LLMs. |
VPC |
vision preprocessing core A hardware unit for processing images in formats such as YUV and RGB, supporting functions like resizing, cropping, image pyramid generation, and color space conversion. |
W
Term/Acronym/Abbreviation |
Definition |
|---|---|
watchdog |
watchdog A hardware device (typically a timer or driver) used to monitor whether a continuously running system is functioning correctly. It communicates with system software through dedicated drivers. As a timer used to monitor software resource states, it starts counting automatically after the program launches. The program must periodically reset the counter (known as "feeding the dog"). If the counter overflows due to a timeout, a watchdog interrupt is triggered, causing a system reset to prevent infinite loops or hangs. |
Y
Term/Acronym/Abbreviation |
Definition |
|---|---|
service plane |
The plane where inference and other service interfaces reside. |
Z
Term/Acronym/Abbreviation |
Definition |
|---|---|
ZeRO |
Zero Redundancy Optimizer An optimizer designed to address memory bottlenecks in large-scale distributed training. It optimizes memory usage by eliminating redundant data, enabling the training of larger models. Compared with traditional data parallelism, ZeRO significantly improves memory efficiency while maintaining computation granularity and communication efficiency. |
network-wide comparison |
A method of tensor comparison within accuracy comparison tools. It involves performing accuracy comparisons across all operators involved in computations within a network model. |