Ascend Glossary

A

Term/Acronym/Abbreviation	Definition
AccDECS	Accelerator for Device-Edge-Cloud Synergy
accumulated relative error	Accumulated relative error algorithm. An accuracy comparison algorithm with a result ranging from 0 to infinity. A value closer to 0 indicates higher similarity, while a larger value indicates a greater discrepancy.
accuracy comparison	Accuracy comparison. The process of comparing dump data generated on an NPU with the Ground Truth (.npy data generated on a GPU/CPU). It is used to analyze the output differences between proprietary operators and industry-standard operators.
ACP	async checkpoint persistence
AI	artificial intelligence A technical science that researches and develops theories, methods, technologies, and application systems to simulate, extend, and expand human intelligence.
AI Core	The computing core of the Ascend AI Processor, responsible for executing compute-intensive matrix, vector, and scalar tasks. Operators developed using the Ascend C programming language run on AI Cores.
AI CPU	A general-purpose CPU provided on the Ascend AI Processor, primarily responsible for executing AI CPU operators and scheduling deterministic tasks.
AIPP	artificial intelligence pre-processing A feature used to perform image preprocessing on AI Cores, including image resizing, color space conversion (CSC), and mean subtraction/multiplication (pixel adjustment), prior to model inference.
AMCT	Ascend Model Compression Toolkit A deep learning model compression toolkit optimized for Ascend AI Processors. It provides features such as quantization and tensor decomposition to reduce model size. Once deployed on Ascend AI Processors, compressed models enable low-bit operations, improving computing efficiency and overall performance.
AMP	asymmetric multiprocessing A multiprocessing architecture where multiple processors exist, and each CPU is assigned a specific task at any given time. Before symmetric multiprocessing matured, it was used as a software workaround to enable multiple processors to run simultaneously. Even with the advent of symmetric multiprocessing, asymmetric multiprocessing remains a simpler and more cost-effective software option for certain applications.
AMP	automatic mixed precision A technique in deep learning to accelerate training and improve efficiency. It achieves this by combining different numerical precisions, typically low-precision floating-point formats (for example, FP16) and high-precision formats (for example, FP32).
AOE	Ascend Optimization Engine A tool that encapsulates Ascend Tensor Compiler (ATC) compilation and AscendCL runtime service interfaces to provide parallel tuning capabilities.
AOL	Ascend Operator Library
ARM	Advanced RISC Machine The first RISC microprocessor designed by Acorn Computers Ltd. for the budget market. While the ARM processor features a 32-bit design, it also supports a 16-bit instruction set, which typically reduces code size by up to 35% compared to equivalent 32-bit code while retaining all 32-bit system advantages.
ARP	Address Resolution Protocol An internet protocol used to map IP addresses to MAC addresses, allowing hosts and routers to determine link-layer addresses through ARP requests and responses.
AscendCL	Ascend Computing Language It provides APIs for runtime management, single-operator calling, model inference, and media data processing. It enables developers to utilize underlying hardware resources on the CANN platform for deep learning inference, image/video preprocessing, and accelerated single-operator computation.
Ascend EP	Ascend Endpoint It refers to the Ascend AI Processor operating in PCIe endpoint mode. In this setup, the host acts as the root complex and the device as the endpoint (EP). AI applications run on the host system, while the Ascend AI Processor is connected as a PCIe endpoint device. The host interacts with the device through PCIe to load and run AI tasks. The Device provides neural network (NN) computing power to the host (x86, Arm, etc.), and its CPU resources are accessible only through the host.
Ascend IR	Ascend Intermediate Representation. An abstract data structure specific to Ascend AI Processors used to represent computation flows. In Ascend documentation, "IR" refers to Ascend IR unless otherwise specified.
Ascend RC	Ascend Root Complex It refers to the Ascend AI Processor operating in PCIe root complex mode. In this mode, the product's CPU directly runs AI service software, and external peripherals such as IP cameras, I²C sensors, and SPI displays are connected as endpoint devices.
ASLR	address space layout randomization
ATB	Ascend Transformer Boost. An acceleration library based on the Ascend AI Processor, specifically designed for the training and inference of Transformer models.
ATC	Ascend Tensor Compiler A model conversion tool within the CANN heterogeneous computing architecture. It converts network models from open-source frameworks and Ascend IR-defined operator descriptions (in JSON format) into offline models (.om format) supported by Ascend AI Processors. During conversion, ATC optimizes operator scheduling, weight data rearrangement, and memory usage to ensure high-performance execution in deployment scenarios.
AVI	Ascend Virtual Instance It refers to the use of resource virtualization technology to partition a single NPU into multiple virtual NPU (vNPU) instances. These instances can be mounted to virtual machines or containers, allowing one NPU to support multiple concurrent compute tasks. By partitioning compute resources, AVI enables virtualized reuse and ensures secure isolation, significantly reducing the cost and entry barrier for NPU utilization while supporting on-demand multi-tenant resource management.

B

Term/Acronym/Abbreviation	Meaning
backend	A module that interfaces the backend of the inference serving framework with the model inference layer.
batch	A set of samples used in a single iteration of model training (that is, one gradient update).
batch size	The number of samples processed in a single batch.
BIU	bus interface unit The interface through which the AI Core interacts with the system bus.
BIOS	basic input/output system Firmware stored on a computer motherboard that includes basic I/O control programs, power-on self-test (POST) routines, bootstrap loaders, and system configuration settings. It provides low-level hardware configuration and control functions.
BLAS	Basic Linear Algebra Subprogram A set of software building blocks that provides optimized routines for performing basic vector and matrix operations in high-performance computing (HPC).
BOM	bill of materials A comprehensive document used in manufacturing that lists the raw materials, primary/secondary processing flows, component breakdowns, and quantities of semi-finished and finished goods. It serves as a key reference for communication between OEMs and partners or across internal departments.
BP Point	Back propagation point, the endpoint of the backward operators within a training network iteration trajectory.

C

Term/Acronym/Abbreviation	Meaning
CA	Certificate Authority
CANN	Compute Architecture for Neural Networks CANN is a heterogeneous compute architecture developed by Ascend for AI scenarios. It serves as a critical bridge by supporting various AI frameworks at the upper layer and managing AI processors and programming at the lower layer. As a key platform for enhancing the computing efficiency of Ascend AI Processors, CANN provides efficient and easy-to-use programming interfaces for diverse application scenarios, enabling developers to rapidly build AI applications and services on the Ascend platform.
CC	cluster computing
CCAE	Cluster Computing Autonomous Engine
CNN	convolutional neural network A type of feedforward neural network in which artificial neurons respond to surrounding units, making it highly effective for large-scale image processing.
Cosine Similarity	Cosine similarity algorithm. An accuracy comparison algorithm with a result range of [-1, 1]. A value closer to 1 indicates higher similarity between the two sets of data, while a value closer to -1 indicates that they are diametrically opposed.
Cube	A computing unit within the AI Core responsible for matrix operations. In a single execution, the Cube unit can complete the multiplication of two 16 × 16 matrices of the FP16 data type.
container	A form of operating system virtualization. It is used to run everything from small microservices or software processes to large-scale applications. A container includes all necessary executables, binary code, libraries, and configuration files required for operation.
CPU	central processing unit
CRI	container runtime interface
CRD	custom resource definition
controller	The management core and decision-making "brain" of the cluster. It manages the operational status of all "Server" services within the cluster, handles PD identity management and decision-making, and governs resource management policies.
coordinator	The entry point for user inference requests. It receives high-concurrency inference requests and performs request scheduling, management, and forwarding, serving as the data request gateway for the entire cluster.
CP	context parallelism A technique that employs data partitioning to split a long input sequence into multiple sub-sequences based on cp_size. Attention is calculated in blocks in parallel, and Key-Value (KV) data is exchanged between adjacent ranks using a ring topology. This optimizes first-token performance for long sequences and is ideal for accelerating P-node processing in long-sequence input scenarios.
CertTools	A set of tools used to generate, configure, encrypt, and manage certificates and keys for MindIE serving, including certificate generation, certificate/key importation, and key encryption operations.

D

Term/Acronym/Abbreviation	Meaning
daemon	In Linux/Unix systems, a daemon is a background system service process that runs independently of a controlling terminal. It typically starts during system boot and terminates when the system shuts down.
DataFlow	A complete computational flow consisting of one or more processing points organized through data queues in a data-driven manner.
DCMI	Davinci Card Management Interface
DDP	distributed data parallel
DDR	double data rate Strictly speaking, it is double data rate synchronous dynamic random access memory (DDR SDRAM). Developed from the SDRAM architecture, DDR allows manufacturers to produce memory with minimal modifications to existing equipment, effectively reducing costs. A DDR memory is developed based on an SDRAM memory, and still uses an SDRAM production system. Unlike traditional single data rate memory, DDR technology performs two read/write operations per clock cycle, one on the rising edge and one on the falling edge of the clock signal.
DECS	device-edge-cloud synergy
DL	deep learning A branch of machine learning that utilizes algorithms consisting of multiple processing layers with complex structures or multiple non-linear transformations to create high-level abstractions of data.
DMA	direct memory access A critical feature of modern computing that allows hardware devices of varying speeds to communicate directly with memory, bypassing the CPU to reduce heavy interrupt overhead.
DP	data parallelism A common parallel strategy in large-scale deep learning training where each process (device) maintains a complete copy of the model and its parameters but processes a different subset of the data.
DPC	Distributed Parallel Client
DRAM	dynamic random access memory A type of primary computer memory used to temporarily store data and instructions required by the CPU for processing.
DSL	domain specific language An operator development method where developers express the computational logic through DSL interfaces. Subsequent tasks such as operator scheduling, optimization, and compilation are automatically handled by existing interfaces.
DSCP	differentiated services code point Based on the Diff-Serv QoS classification standard, DSCP uses 6 bits of the Type of Service (ToS) byte in the IP header to differentiate traffic priorities. It combines the IP Precedence and Type of Service fields to maintain backward compatibility with older routers. Each DSCP value maps to a defined Per-Hop Behavior (PHB), allowing end devices to mark and classify traffic.
bandwidth	The range of frequencies that a transmission line or channel in a network can carry. It is the difference between the highest and lowest frequencies of the channel. Greater bandwidth typically results in faster data transmission rates.
single-operator comparison	A method of tensor comparison within accuracy comparison tools. It involves selecting one or more specific operators in a network model to analyze their computational accuracy.

E

Term/Acronym/Abbreviation	Meaning
ECC	error checking and correction A technology that adds check bits to the original data bits to detect and correct data errors.
eMMC	embedded MultiMediaCard A managed flash memory storage system. It features an external interface similar to an SD card, internal flash storage media, and an integrated bad block management system.
epoch	One complete pass of the entire dataset through the training algorithm.
EULA	End User License Agreement
ESN	equipment serial number A unique string that identifies a device. It is a critical key for binding licenses to specific hardware, also known as a "device fingerprint."
EndPoint	An inference serving protocol and API wrapper compatible with third-party framework interfaces such as Triton, OpenAI, TGI, and vLLM.
EP	Expert parallelism. A model parallelism technique that partitions parameters by assigning different experts in a Mixture of Experts (MoE) model to different devices. A gating mechanism routes inputs to specific experts, activating the corresponding devices.

F

Term/Acronym/Abbreviation	Definition
Faiss	An open-source library developed by Meta (formerly Facebook) for efficient similarity search and clustering of dense vectors.
FEC	forward error correction A digital signal processing technique used to enhance data reliability by introducing redundant data, allowing the receiver to detect and correct errors during data transmissions.
FFT	fast Fourier transform An algorithm that computes the discrete Fourier transform (DFT) or its inverse (IDFT). It converts signals from their original domain (often time or space) to the frequency domain, and vice versa.
FFTS	function flow task scheduler A data-flow-driven parallel scheduling mechanism. It utilizes a subgraph data management unit (DMU) mechanism to eliminate unnecessary direct memory access (DMA) copy overhead and provides sub-task threading and inter-thread scheduling to maximize hardware parallelism across AI Cores or AI Vectors, achieving effective operator fusion.
Flash Attention	An IO-aware, exact attention algorithm used for model acceleration. It speeds up attention computations and reduces memory footprint without approximation. It is widely implemented in LLMs such as Llama and GPT-3.
FLOPS	floating-point operations per second A measure of computer performance, particularly in scientific computing fields involving heavy floating-point calculations. Note that the "S" stands for "second" and is not a plural indicator, thus it should never be omitted.
FP Point	Forward-propagation point. The starting position of forward operators within a training network's iteration trajectory.
FUSE	Filesystem in Userspace An operating system mechanism that allows non-privileged users to create their own file systems without editing kernel code. It is supported in Linux through a kernel module and utilized by file systems like ZFS, GlusterFS, and Lustre.

G

Term/Acronym/Abbreviation	Definition
GDAT	gradient auto tuning An optimization tool that minimizes communication tail latency by maximizing the parallelism between backward computation and gradient aggregation. In distributed training, fusion strategies for gradient aggregation operators impact the communication overhead after backward passes, thereby affecting overall cluster performance and scaling linearity.
GDB	GNU debugger. A standard debugging tool for monitoring the internal execution of programs or analyzing crashes. GDB supports the following four main operations to help locate defects: Starting programs with specific parameters Pausing execution under defined conditions Inspecting program state upon termination/pause Modifying program content to test fixes
GE	graph engine A core component providing graph/operator intermediate representation (IR) as a secure and intuitive interface for model building. It allows users to build network models, define computational graphs and operators, and configure associated attributes.
GM	Global memory. The main memory on the device side. It serves as the external storage for the AI Core and is used for large-scale data, requiring optimized access patterns to maximize throughput.
gRPC	Google Remote Procedure Call
GRPO	Group Relative Policy Optimization A reinforcement learning (RL) algorithm designed to enhance reasoning capabilities in LLMs. Unlike traditional RL methods that rely on external value functions, GRPO optimizes models by evaluating relative performance within groups of generated responses, significantly improving training efficiency.
management plane	The architectural layer or network segment where health status and monitoring interfaces reside.

H

Term/Acronym/Abbreviation	Definition
HCC	Huawei Compiler Collection
HCCL	Huawei Collective Communication Library A library providing high-performance collective communication functions for distributed deep learning across multiple servers.
HCCP	Huawei Collective Communication Adaptive Protocol A protocol layer providing cross-NPU communication capabilities while abstracting away differences in underlying transport protocols for upper-layer applications.
HCCS	Huawei Cache Coherence System A system designed for high-speed interconnect between CPUs and NPUs.
HDC	host-device communication A communication module deployed on both the host and device sides to facilitate data exchange between them.
HDK	hardware developer kit
HDR	high dynamic range A technique used in imaging and audio to reproduce a greater range of luminosity or signal levels than standard digital techniques.
HPA	HorizontalPodAutoscaler A Kubernetes feature that automatically scales the number of Pods in a workload (such as a Deployment or StatefulSet) based on observed CPU utilization or other selected metrics.
HPO	hyperparameter optimization The process of automating the search for the optimal or near-optimal hyperparameters of a machine learning model, replacing manual tuning with algorithmic search strategies.

I

Term/Acronym/Abbreviation	Definition
ICS	Intellectual Collaborative Service
IOPS	input/output operations per second An input/output performance measurement used to characterize computer storage devices.
IPC	IP camera
ISP	image signal processing A method or specialized hardware unit used to process raw data from image sensors to render a high-quality digital image, ensuring compatibility across different sensor manufacturers.
ISV	independent software vendor

J

Term/Acronym/Abbreviation	Definition
JDK	Java Development Kit A software development environment used for developing Java applications, containing a collection of tools and libraries.
JPEGD	JPEG decoder A specialized hardware or software module that provides the capability to decode images from the JPEG format.
JPEGE	JPEG encoder A specialized hardware or software module that provides the capability to encode images into the JPEG format.

Term/Acronym/Abbreviation

Definition

JDK

Java Development Kit

A software development environment used for developing Java applications, containing a collection of tools and libraries.

JPEGD

JPEG decoder

A specialized hardware or software module that provides the capability to decode images from the JPEG format.

JPEGE

JPEG encoder

A specialized hardware or software module that provides the capability to encode images into the JPEG format.

K

Term/Acronym/Abbreviation	Definition
KMC	Key Management CBB A module designed to facilitate code sharing and simplify development. It implements core functions such as encrypted key storage and encryption/decryption to enable rapid product integration.
KMC	Key Management Center A centralized system used to manage and protect cryptographic keys. It provides secure key storage, distribution, rotation, backup, and recovery. The KMC keystore ensures key security and reliability while supporting multiple encryption algorithms and key lengths across various application scenarios.
KL divergence	Kullback-Leibler divergence An accuracy comparison algorithm used to measure the difference between two probability distributions. Values range from 0 to infinity. A lower KL divergence indicates a closer match between the true and approximate distributions.
Kubernetes	An open-source system for automating the deployment, scaling, and management of containerized applications. It provides a platform for automated deployment, scaling, and operation of application containers across clusters of hosts.
KASLR	Kernel address space layout randomization. A security mechanism that randomizes the memory address layout of the kernel, increasing the difficulty of exploiting kernel vulnerabilities.

L

Term/Acronym/Abbreviation	Definition
L0A buffer	An internal physical storage unit within the AI Core, typically used to store the left matrix for matrix multiplication. It corresponds to the logical memory AscendC::TPosition::A2.
L0B buffer	An internal physical storage unit within the AI Core, typically used to store the right matrix for matrix multiplication. It corresponds to the logical memory AscendC::TPosition::B2.
L0C buffer	An internal physical storage unit within the AI Core, typically used to store the results of matrix computation. It corresponds to the logical memory AscendC::TPosition::CO1.
L1 buffer	An internal physical storage unit within the AI Core with a relatively large capacity, typically used to cache input data for matrix multiplication. Input data is generally moved from global memory (GM) to the L1 buffer, and then to the L0A and L0B buffers. It corresponds to logical memory AscendC::TPosition::A1 and AscendC::TPosition::B1.
L2 cache	level 2 cache A secondary CPU cache used to provide faster access to frequently used data and instructions before accessing the main memory.
LLDP	Link Layer Discovery Protocol A layer 2 discovery protocol defined in IEEE 802.1ab. It enables network management systems to quickly acquire layer 2 network topology and change information as the network scales.
LLM	large language model A type of language model consisting of artificial neural networks with a massive number of parameters (typically billions or more), trained on large datasets of unlabeled text using self-supervised or semi-supervised learning.
local memory	The internal storage of the AI Core, including storage units such as the L1 buffer, L0A buffer, L0B buffer, L0C buffer, and unified buffer.
loss	The deviation between predicted values and actual values, serving as a primary metric in deep learning to evaluate model performance.
LTO	link time optimization A type of program optimization performed by a compiler during the linking stage.
adjacency list	A common data structure in graph theory and computer science used to represent a graph, where each vertex stores a list or array of all other vertices to which it is connected.
LoRA	low-rank adaptation A parameter-efficient fine-tuning (PEFT) method for large-scale models.

M

Term/Acronym/Abbreviation	Definition
MAC	media access control A data link layer protocol that manages how multiple devices share a common transmission medium to prevent data collisions.
Max Absolute Error	maximum absolute error An accuracy comparison algorithm with a range from 0 to infinity. A value closer to 0 indicates higher similarity, while a larger value indicates a greater discrepancy.
Max RelativeError	maximum relative error An accuracy comparison algorithm with a range from 0 to infinity. A value closer to 0 indicates higher similarity, while a larger value indicates a greater discrepancy.
MCU	microcontroller unit An integrated circuit that integrates multiple functional modules such as the processor, memory, and input/output interfaces.
Mean Absolute Error	mean absolute error An accuracy comparison algorithm with the result ranging from 0 to infinity. If both the mean absolute error (MAE) and root mean square error (RMSE) tend to 0, it indicates that the measured value is closer to the actual value. If the MAE tends towards 0 while the RMSE is larger, it indicates the presence of local outliers with excessively large values. If the MAE is large and the RMSE is equal to or close to the MAE, it indicates that the overall deviation is highly concentrated. If the MAE is large and the RMSE is significantly larger than the MAE, it indicates the presence of an overall deviation, and this overall deviation is scattered. There are no exceptions to the above cases because RMSE ≥ MAE always holds true.
Mean Relative Error	mean relative error An accuracy comparison algorithm with results ranging from 0 to infinity. A result closer to 0 indicates higher similarity.
MemFS	memory file system
MindIE	Mind Inference Engine A high-performance deep learning inference framework optimized for Ascend hardware, supporting acceleration, debugging, tuning, and rapid deployment.
MindFormers	MindSpore Transformers. An end-to-end suite based on the MindSpore framework. It supports the entire lifecycle of LLMs, including training, fine-tuning, evaluation, and deployment.
MindIO	A memory-based caching system designed to accelerate the read/write speeds of training checkpoints.
MinIO	An object storage service component.
MLP	multilayer perceptron A feed-forward artificial neural network consisting of an input layer, one or more hidden layers, and an output layer. MLP can be used to solve a variety of problems, such as classification and regression. Due to its powerful representational capabilities, MLP is widely applied in many fields, including image recognition, natural language processing, and more.
MoE	Mixture of experts. It is a technology used to train models with trillions of parameters. MoE decomposes predictive modeling tasks into several sub-tasks, training an expert model for each sub-task and developing a gating model. This gating model assigns one or more experts based on the input data, and finally integrates the calculation results from multiple experts to produce the prediction result.
msDebug	An operator debugging tool. It provides native environment debugging on Ascend processors, featuring flexible variable inspection and step-by-step execution.
msKPP	A performance modeling and tuning tool designed for operator theoretical performance and template libraries. In the performance modeling phase, the tool utilizes built-in operator API performance data, enabling users to express implementation algorithms and evaluate performance during the initial design stage. In the template library tuning phase, it provides capabilities for the generation, compilation, and execution of template library kernel dispatch code. Additionally, it supports code replacement within the kernel combined with automatic performance tuning.
msProf	An operator profiling tool. It collects performance data from both hardware and simulation, visualized through MindStudio Insight to identify performance bottlenecks.
msproftx	msProf tool extension. An extension for the MindStudio system profiling tools.
msSanitizer	An operator anomaly detection tool. It provides memory detection and contention detection capabilities, supporting precise localization of memory issues in multi-core programs.
MTE	memory transfer engine Also known as the load-store unit (LSU), it manages data read/write between different buffers within the AI Core and handles format conversions.
MTE1	Memory transfer engine 1. Tiered memory transfer engine responsible for data movement from the L1 buffer to the L0A buffer or L0B buffer based on hardware capabilities.
MTE2	Memory transfer engine 2. Tiered memory transfer engine responsible for data movement from the global memory to the L1 buffer, L0A buffer, L0B buffer, or unified buffer based on hardware capabilities.
MTE3	Memory transfer engine 3. Tiered memory transfer engine responsible for data movement from the unified buffer to the global memory or L1 buffer based on hardware capabilities.
MTU	maximum transmission unit The maximum data packet size that can be transmitted over a network. The size varies depending on the network type. For example, it is 576 bytes in X.25 networks, 1500 bytes in Ethernet, and 17,914 bytes in 16Mbit/s Token Ring. The MTU size is determined by the link layer of the network. When a packet is transmitted across a network, the path MTU (PMTU) determines the smallest packet size among all involved networks, representing the maximum packet size that can be transmitted across the entire path without fragmentation.
MindIE SD	Mind Inference Engine Stable Diffusion, a suite of visual generation inference models within the MindIE ecosystem.
MindIE Turbo	Mind Inference Engine Turbo, an acceleration plugin library developed for LLM inference on Ascend hardware.
MindIE Motor	Mind Inference Engine Motor, a request scheduling framework specifically designed for LLM PD (prefill-decode) disaggregation inference. It provides inference serving capabilities through an open and extensible platform architecture, and interfaces downstream with MindIE LLM to meet the high-performance inference requirements of large language models.
MindIE LLM	Mind Inference Engine Large Language Model, the dedicated inference component for large language models within the MindIE framework.
MLA	Multi-head Latent Attention, an efficient attention mechanism that uses low-rank KV joint compression to eliminate KV cache bottlenecks during inference.
MTP	Multi-token prediction, a parallel decoding method introduced by DeepSeek to generate multiple tokens in a single step. The core logic is that the model does not limit itself to predicting only the next single token. Instead, it predicts multiple subsequent tokens simultaneously, significantly accelerating model generation speeds.
MindIE Service Tools	A toolset for Ascend inference services, featuring performance/accuracy testing, visualization, automated optimization, and configurable throughput.
MindIE Simulator	An automated service performance tuning tool that simulates various strategies to find optimal parameters under latency constraints.

N

Term/Acronym/Abbreviation	Definition
NCS	Neural Compute Server NCS encapsulates AscendCL runtime service interfaces to accept remote hardware execution requests and return corresponding performance data.
NIC	network interface controller Also known as network interface card, network adapter, LAN adapter, or other similar terms. It refers to a hardware component that connects a computer to a computer network.
NLP	natural language processing A subdiscipline of artificial intelligence and linguistics that explores how to process and utilize natural language. NLP involves various aspects and stages, primarily including cognition, understanding, and generation.
NN	neural network In the fields of machine learning and cognitive science, a neural network is a mathematical model or computing model that emulates the structure and functions of a biological neural network.
NPU	Neural-Network Processing Unit. Utilizing a "data-driven parallel computing" architecture, it is specifically designed to handle massive computational tasks in AI applications.
NUMA	non-uniform memory access NUMA is a distributed memory access architecture where processors can access different memory addresses simultaneously to significantly enhance parallelism. In this mode, processors are divided into multiple nodes, with each node allocated its own local memory space. While processors in any node can access all physical memory, the latency for accessing local memory is much lower than that for accessing remote nodes.
NVMe	Non-Volatile Memory Express A logical device interface specification. It is a bus transport protocol based on a logical device interface (equivalent to the application layer in communication protocols), used to access non-volatile storage media (such as flash-based SSDs) attached through the PCI Express (PCIe) bus.

O

Term/Acronym/Abbreviation	Definition
OM	offline model
ONNX	Open Neural Network Exchange ONNX is an open-source file format designed for machine learning to store trained models. It enables different AI frameworks to share and exchange model data using a unified format.
OOM	out of memory
OP	Operator. An operator is the fundamental unit for executing specific mathematical calculations or operations within deep learning algorithms, such as activation functions (for example, ReLU), convolution, pooling, and normalization (for example, Softmax). Neural network models are constructed by combining these operators.
OPAT	operator auto-tuning OPAT is an optimizer designed to enhance operator performance. When AOE inputs a full graph into OPAT, OPAT performs operator fusion internally and partitions the fused graph at the operator level. It generates different tuning strategies for each fused operator subgraph to achieve optimal performance, subsequently saving these strategies in the operator knowledge base.
OpenPGP	Open Pretty Good Privacy Pretty Good Privacy (PGP) is an encryption program that provides cryptographic privacy and authentication for data communication. It is commonly used for signing, encrypting, and decrypting texts, emails, and files. OpenPGP is a non-proprietary protocol that defines a unified standard for encrypted messages, signatures, private keys, and certificates used for public key exchange.
OPP	operator package
OS	operating system
OS	optimizer state
OCI	Open Container Initiative Established by the Linux Foundation in June 2015, the OCI aims to create open industry standards for container formats and runtimes.
OM Adapter	Reports MindIE heartbeats, alarms, resource information, and logs to external alarm and management platforms, enabling service status monitoring and integration with management systems.

P

Term/Acronym/Abbreviation	Definition
PCIe	Peripheral Component Interconnect Express A high-speed serial expansion bus standard commonly used for peripheral expansion in computer systems.
PCB	printed circuit board
PFC	priority-based flow control A flow control mechanism based on priorities.
PMU	performance monitor unit A hardware unit provided by the CPU that enables the reading of CPU performance data by accessing relevant registers.
PNGD	PNG decoder A component that provides the capability to decode images in PNG format.
Pod	The smallest deployable unit that can be created in Kubernetes and a top-level resource type in the Kubernetes REST API.
PP	pipeline parallelism A technique that distributes different layers of a model across various computing devices to reduce individual device memory consumption, enabling the training of ultra-large-scale models.
PWM	pulse width modulation A modulation technique where the pulse duration (width) of a pulse carrier varies according to the sample values of the modulating wave.
on-chip memory	Memory integrated directly onto a microprocessor chip.

Q

Term/Acronym/Abbreviation	Definition
QAT	quantization-aware training A quantization method that introduces quantization during the retraining process, enhancing the model's robustness to quantization effects through retraining to achieve higher accuracy in the quantized model.

Term/Acronym/Abbreviation

Definition

QAT

quantization-aware training

A quantization method that introduces quantization during the retraining process, enhancing the model's robustness to quantization effects through retraining to achieve higher accuracy in the quantized model.

R

Term/Acronym/Abbreviation	Definition
RDMA	Remote direct memory access, a technology that transfers data directly from the memory of one machine to another without involving the operating systems of either host. It generally refers to a memory access method that spans across a network.
RED	relative Euclidean distance An accuracy comparison algorithm. The computation result ranges from 0 to infinity. A result value closer to 0 indicates higher similarity, while a larger result value indicates a greater discrepancy.
RoCE	RDMA over Converged Ethernet A network protocol that enables remote direct memory access (RDMA) over Ethernet. There are currently two versions: RoCE v1 and RoCE v2. RoCE v1 is a data link layer protocol that allows communication between any two hosts within the same Ethernet broadcast domain. RoCE v2 is a network layer protocol and its packets can be routed.
RMSE	root mean square error An accuracy comparison algorithm. The result ranges from 0 to infinity. If both the mean absolute error (MAE) and root mean square error (RMSE) tend to 0, it indicates that the measured value is closer to the actual value. If the MAE tends towards 0 while the RMSE is larger, it indicates the presence of local outliers with excessively large values. If the MAE is large and the RMSE is equal to or close to the MAE, it indicates that the overall deviation is highly concentrated. If the MAE is large and the RMSE is significantly larger than the MAE, it indicates the presence of an overall deviation, and this overall deviation is scattered. There are no exceptions to the above cases because RMSE ≥ MAE always holds true.
Runtime	Provides applications with functions such as memory management, device management, stream management, event management, and kernel loading and execution specifically for Ascend AI processors.
RAM	random access memory A type of semiconductor-based memory that can be read and written by the CPU or other hardware devices. The storage locations can be accessed in any order.
runC	A client tool for creating and running containers according to the OCI (Open Container Initiative) specification.
RoPE	rotary position embedding A position encoding method that integrates relative position dependencies into self-attention, enhancing the performance of the Transformer architecture.
RAS	Reliability, availability, and serviceability. It refers to capabilities that enhance the reliability, availability, and serviceability of prefill-decode (PD) disaggregation services.

S

Term/Acronym/Abbreviation	Definition
scalar	The scalar computing unit within the AI Core. It is primarily responsible for scalar data operations and issuing instructions to other units, such as the memory transfer engine (MTE), vector unit, and cube unit.
SDMA	System direct memory access, also known as direct memory access (DMA). This technology allows peripheral devices to access system memory directly without CPU intervention.
SiP	Ascend Signal Processing Boost A signal processing acceleration library that provides a series of high-performance operators for AI models (supporting PyTorch calls) and signal processing (supporting direct C++ calls).
SGAT	subgraph auto-tuning SGAT is an optimizer that improves the performance of subgraphs. A complete network can be partitioned into multiple subgraphs. SGAT can be used to generate different tiling policies for those subgraphs. By acquiring performance data for each iteration, the SGAT algorithm identifies the optimal tuning strategy to achieve peak performance for the corresponding subgraph.
SPI	serial peripheral interface A synchronous serial communication interface that enables information exchange between the microcontroller unit (MCU) and peripherals.
SP	sequence parallelism A parallel computing method that performs column partitioning on input sequences to further improve efficiency on top of tensor parallelism (TP). Since it does not introduce additional communication overhead, it is recommended to enable SP concurrently with TP.
SRAM	static random access memory A type of computer memory that is faster and more reliable than common DRAM. It is typically used for caches, registers, and other applications requiring high-speed access.
SwiGLU	Swish-Gated Linear Units An activation function variant of gated linear units (GLU) that incorporates the Swish activation function.
SIMD	single instruction multiple data A parallel computing architecture where a single instruction processor fetches an instruction and distributes it to multiple processing elements to operate on different data points simultaneously.
SSL	Secure Sockets Layer A security protocol operating at the socket layer, situated between the TCP and application layers. It is used for data encryption/decryption and entity authentication.
standard deviation	standard deviation An accuracy comparison algorithm with a result ranging from 0 to infinity. A smaller standard deviation indicates lower dispersion, meaning values are closer to the mean.
STARS	system task and resource scheduler
spine-leaf	A two-tier network topology consisting of leaf switches and spine switches. Spine switches act as the core, typically using high-port-density switches rather than traditional large chassis switches. Leaf switches serve as the access layer, providing connectivity to endpoints and servers while up-linking to the spine switches. This topology is designed to handle rapid traffic growth and large-scale data center expansion, overcoming the limitations of traditional three-tier architectures in high-speed internal interconnection.
sample-based	A profiling method where AI Core performance data is collected at fixed time intervals (AI Core-sampling interval).
step trace	Iteration trace. It captures start and end times for forward and backward passes, gradient updates, and data augmentation trailing phases.
ST	system test A black-box testing phase based on the system requirement specifications, covering all integrated components. It evaluates the complete product system to verify compliance with requirement specifications and identify discrepancies. The scope includes not only the software but also the underlying hardware, peripherals, data, support software, and interfaces, requiring testing within the system's actual operating environment.
Ascend Virtual Instance (AVI)	It refers to the use of resource virtualization technology to partition a single NPU into multiple virtual NPU (vNPU) instances. These instances can be mounted to VMs or containers, allowing one NPU to support multiple concurrent tasks. By partitioning compute resources, AVI enables virtualized reuse and ensures secure isolation, significantly reducing the cost and entry barrier for NPU utilization while supporting on-demand multi-tenant resource management.
SLO	service-level objective A target value or range of values for a specific service level that is measured over a predefined period.

T

Term/Acronym/Abbreviation	Definition
task-based	A profiling method where AI Core performance data is collected at the task level.
TCP	Transmission Control Protocol
TDP	thermal design power The maximum amount of heat generated by a computer chip or component that the cooling system in a computer is designed to dissipate under any workload.
tensor	A container for data used in operator computations. It is an N-dimensional data structure, most commonly represented as a scalar, vector, or matrix. Tensor elements can include integers, floating-point values, or strings.
TFT	training fault tolerance
TIK	Tensor Iterator Kernel An operator development method that allows developers to write custom operators using Python-based APIs provided by TIK. The TIK compiler converts these into binary files compatible with Ascend AI Processor applications.
TGI	Text Generation Inference A toolkit for deploying and serving LLMs. TGI enables high-performance text generation for popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, and GPT-NeoX.
TLS	Transport Layer Security
TP	tensor parallelism A technique that partitions tensors within a network across different devices to reduce memory consumption per device, enabling the training of ultra-large-scale models.
Triton	Triton Inference Server An open-source inference serving software that lets teams deploy, run, and scale trained AI models from any framework on any GPU- or CPU-based infrastructure (cloud, data center, or edge).
TTP	try to persist
TPOT	Time per output token. The latency for each output token, excluding the first token. In offline batch processing applications, TPOT is a critical metric as it determines the overall duration of the inference process.
TTFT	Time to first token. A key metric in LLM inference representing the latency from the initial input to the generation of the first output token.

U

Term/Acronym/Abbreviation	Definition
UCE	uncorrectable memory error
Unified Buffer	An internal storage unit within the AI Core, primarily used for vector computations.
UDF	user-defined function
UUID	universally unique identifier A standard used in software construction and part of the distributed computing environment (DCE) as defined by the Open Software Foundation (OSF).
UT	unit test The lowest level of testing performed during software development, where individual units of software are tested in isolation from the rest of the application.
UDP	User Datagram Protocol A protocol in the TCP/IP suite that provides a simple interface between the network and application layers. UDP provides unreliable data transfer. Once data is sent to the network layer, no backup is retained. It adds only multiplexing and checksum fields to the IP datagram header.

V

Term/Acronym/Abbreviation	Definition
vcjob	VolcanoJob, a job type managed by Volcano, a batch scheduling system for Kubernetes.
VDEC	video decoder A component that provides the capability to decode video streams of specific formats.
VENC	video encoder A component that provides the capability to encode images into video streams of specific formats.
vector	The vector computing unit within the AI Core, responsible for performing vector operations. It offers lower computing power than the cube unit but higher flexibility (for example, supporting reciprocal and square root operations).
vLLM	An open-source high-throughput serving and memory-efficient inference engine for LLMs.
VPC	vision preprocessing core A hardware unit for processing images in formats such as YUV and RGB, supporting functions like resizing, cropping, image pyramid generation, and color space conversion.

W

Term/Acronym/Abbreviation	Definition
watchdog	watchdog A hardware device (typically a timer or driver) used to monitor whether a continuously running system is functioning correctly. It communicates with system software through dedicated drivers. As a timer used to monitor software resource states, it starts counting automatically after the program launches. The program must periodically reset the counter (known as "feeding the dog"). If the counter overflows due to a timeout, a watchdog interrupt is triggered, causing a system reset to prevent infinite loops or hangs.

Term/Acronym/Abbreviation

Definition

watchdog

watchdog

A hardware device (typically a timer or driver) used to monitor whether a continuously running system is functioning correctly. It communicates with system software through dedicated drivers.

As a timer used to monitor software resource states, it starts counting automatically after the program launches. The program must periodically reset the counter (known as "feeding the dog"). If the counter overflows due to a timeout, a watchdog interrupt is triggered, causing a system reset to prevent infinite loops or hangs.

Y

Term/Acronym/Abbreviation	Definition
service plane	The plane where inference and other service interfaces reside.

Z

Term/Acronym/Abbreviation	Definition
ZeRO	Zero Redundancy Optimizer An optimizer designed to address memory bottlenecks in large-scale distributed training. It optimizes memory usage by eliminating redundant data, enabling the training of larger models. Compared with traditional data parallelism, ZeRO significantly improves memory efficiency while maintaining computation granularity and communication efficiency.
network-wide comparison	A method of tensor comparison within accuracy comparison tools. It involves performing accuracy comparisons across all operators involved in computations within a network model.

Term/Acronym/Abbreviation

Definition

ZeRO

Zero Redundancy Optimizer

An optimizer designed to address memory bottlenecks in large-scale distributed training. It optimizes memory usage by eliminating redundant data, enabling the training of larger models. Compared with traditional data parallelism, ZeRO significantly improves memory efficiency while maintaining computation granularity and communication efficiency.

network-wide comparison

A method of tensor comparison within accuracy comparison tools. It involves performing accuracy comparisons across all operators involved in computations within a network model.