Ascend Glossary

A

Term/Acronym/Abbreviation

Definition

AccDECS

Accelerator for Device-Edge-Cloud Synergy

accumulated relative error

Accumulated relative error algorithm.

An accuracy comparison algorithm with a result ranging from 0 to infinity. A value closer to 0 indicates higher similarity, while a larger value indicates a greater discrepancy.

accuracy comparison

Accuracy comparison.

The process of comparing dump data generated on an NPU with the Ground Truth (.npy data generated on a GPU/CPU). It is used to analyze the output differences between proprietary operators and industry-standard operators.

ACP

async checkpoint persistence

AI

artificial intelligence

A technical science that researches and develops theories, methods, technologies, and application systems to simulate, extend, and expand human intelligence.

AI Core

The computing core of the Ascend AI Processor, responsible for executing compute-intensive matrix, vector, and scalar tasks. Operators developed using the Ascend C programming language run on AI Cores.

AI CPU

A general-purpose CPU provided on the Ascend AI Processor, primarily responsible for executing AI CPU operators and scheduling deterministic tasks.

AIPP

artificial intelligence pre-processing

A feature used to perform image preprocessing on AI Cores, including image resizing, color space conversion (CSC), and mean subtraction/multiplication (pixel adjustment), prior to model inference.

AMCT

Ascend Model Compression Toolkit

A deep learning model compression toolkit optimized for Ascend AI Processors. It provides features such as quantization and tensor decomposition to reduce model size. Once deployed on Ascend AI Processors, compressed models enable low-bit operations, improving computing efficiency and overall performance.

AMP

asymmetric multiprocessing

A multiprocessing architecture where multiple processors exist, and each CPU is assigned a specific task at any given time. Before symmetric multiprocessing matured, it was used as a software workaround to enable multiple processors to run simultaneously. Even with the advent of symmetric multiprocessing, asymmetric multiprocessing remains a simpler and more cost-effective software option for certain applications.

AMP

automatic mixed precision

A technique in deep learning to accelerate training and improve efficiency. It achieves this by combining different numerical precisions, typically low-precision floating-point formats (for example, FP16) and high-precision formats (for example, FP32).

AOE

Ascend Optimization Engine

A tool that encapsulates Ascend Tensor Compiler (ATC) compilation and AscendCL runtime service interfaces to provide parallel tuning capabilities.

AOL

Ascend Operator Library

ARM

Advanced RISC Machine

The first RISC microprocessor designed by Acorn Computers Ltd. for the budget market. While the ARM processor features a 32-bit design, it also supports a 16-bit instruction set, which typically reduces code size by up to 35% compared to equivalent 32-bit code while retaining all 32-bit system advantages.

ARP

Address Resolution Protocol

An internet protocol used to map IP addresses to MAC addresses, allowing hosts and routers to determine link-layer addresses through ARP requests and responses.

AscendCL

Ascend Computing Language

It provides APIs for runtime management, single-operator calling, model inference, and media data processing. It enables developers to utilize underlying hardware resources on the CANN platform for deep learning inference, image/video preprocessing, and accelerated single-operator computation.

Ascend EP

Ascend Endpoint

It refers to the Ascend AI Processor operating in PCIe endpoint mode.

In this setup, the host acts as the root complex and the device as the endpoint (EP). AI applications run on the host system, while the Ascend AI Processor is connected as a PCIe endpoint device. The host interacts with the device through PCIe to load and run AI tasks. The Device provides neural network (NN) computing power to the host (x86, Arm, etc.), and its CPU resources are accessible only through the host.

Ascend IR

Ascend Intermediate Representation. An abstract data structure specific to Ascend AI Processors used to represent computation flows. In Ascend documentation, "IR" refers to Ascend IR unless otherwise specified.

Ascend RC

Ascend Root Complex

It refers to the Ascend AI Processor operating in PCIe root complex mode.

In this mode, the product's CPU directly runs AI service software, and external peripherals such as IP cameras, I2C sensors, and SPI displays are connected as endpoint devices.

ASLR

address space layout randomization

ATB

Ascend Transformer Boost. An acceleration library based on the Ascend AI Processor, specifically designed for the training and inference of Transformer models.

ATC

Ascend Tensor Compiler

A model conversion tool within the CANN heterogeneous computing architecture. It converts network models from open-source frameworks and Ascend IR-defined operator descriptions (in JSON format) into offline models (.om format) supported by Ascend AI Processors. During conversion, ATC optimizes operator scheduling, weight data rearrangement, and memory usage to ensure high-performance execution in deployment scenarios.

AVI

Ascend Virtual Instance

It refers to the use of resource virtualization technology to partition a single NPU into multiple virtual NPU (vNPU) instances. These instances can be mounted to virtual machines or containers, allowing one NPU to support multiple concurrent compute tasks. By partitioning compute resources, AVI enables virtualized reuse and ensures secure isolation, significantly reducing the cost and entry barrier for NPU utilization while supporting on-demand multi-tenant resource management.

B

Term/Acronym/Abbreviation

Meaning

backend

A module that interfaces the backend of the inference serving framework with the model inference layer.

batch

A set of samples used in a single iteration of model training (that is, one gradient update).

batch size

The number of samples processed in a single batch.

BIU

bus interface unit The interface through which the AI Core interacts with the system bus.

BIOS

basic input/output system

Firmware stored on a computer motherboard that includes basic I/O control programs, power-on self-test (POST) routines, bootstrap loaders, and system configuration settings. It provides low-level hardware configuration and control functions.

BLAS

Basic Linear Algebra Subprogram

A set of software building blocks that provides optimized routines for performing basic vector and matrix operations in high-performance computing (HPC).

BOM

bill of materials

A comprehensive document used in manufacturing that lists the raw materials, primary/secondary processing flows, component breakdowns, and quantities of semi-finished and finished goods. It serves as a key reference for communication between OEMs and partners or across internal departments.

BP Point

Back propagation point, the endpoint of the backward operators within a training network iteration trajectory.

C

Term/Acronym/Abbreviation

Meaning

CA

Certificate Authority

CANN

Compute Architecture for Neural Networks

CANN is a heterogeneous compute architecture developed by Ascend for AI scenarios. It serves as a critical bridge by supporting various AI frameworks at the upper layer and managing AI processors and programming at the lower layer. As a key platform for enhancing the computing efficiency of Ascend AI Processors, CANN provides efficient and easy-to-use programming interfaces for diverse application scenarios, enabling developers to rapidly build AI applications and services on the Ascend platform.

CC

cluster computing

CCAE

Cluster Computing Autonomous Engine

CNN

convolutional neural network

A type of feedforward neural network in which artificial neurons respond to surrounding units, making it highly effective for large-scale image processing.

Cosine Similarity

Cosine similarity algorithm.

An accuracy comparison algorithm with a result range of [-1, 1]. A value closer to 1 indicates higher similarity between the two sets of data, while a value closer to -1 indicates that they are diametrically opposed.

Cube

A computing unit within the AI Core responsible for matrix operations. In a single execution, the Cube unit can complete the multiplication of two 16 × 16 matrices of the FP16 data type.

container

A form of operating system virtualization.

It is used to run everything from small microservices or software processes to large-scale applications. A container includes all necessary executables, binary code, libraries, and configuration files required for operation.

CPU

central processing unit

CRI

container runtime interface

CRD

custom resource definition

controller

The management core and decision-making "brain" of the cluster. It manages the operational status of all "Server" services within the cluster, handles PD identity management and decision-making, and governs resource management policies.

coordinator

The entry point for user inference requests. It receives high-concurrency inference requests and performs request scheduling, management, and forwarding, serving as the data request gateway for the entire cluster.

CP

context parallelism A technique that employs data partitioning to split a long input sequence into multiple sub-sequences based on cp_size. Attention is calculated in blocks in parallel, and Key-Value (KV) data is exchanged between adjacent ranks using a ring topology. This optimizes first-token performance for long sequences and is ideal for accelerating P-node processing in long-sequence input scenarios.

CertTools

A set of tools used to generate, configure, encrypt, and manage certificates and keys for MindIE serving, including certificate generation, certificate/key importation, and key encryption operations.

D

Term/Acronym/Abbreviation

Meaning

daemon

In Linux/Unix systems, a daemon is a background system service process that runs independently of a controlling terminal. It typically starts during system boot and terminates when the system shuts down.

DataFlow

A complete computational flow consisting of one or more processing points organized through data queues in a data-driven manner.

DCMI

Davinci Card Management Interface

DDP

distributed data parallel

DDR

double data rate

Strictly speaking, it is double data rate synchronous dynamic random access memory (DDR SDRAM). Developed from the SDRAM architecture, DDR allows manufacturers to produce memory with minimal modifications to existing equipment, effectively reducing costs. A DDR memory is developed based on an SDRAM memory, and still uses an SDRAM production system.

Unlike traditional single data rate memory, DDR technology performs two read/write operations per clock cycle, one on the rising edge and one on the falling edge of the clock signal.

DECS

device-edge-cloud synergy

DL

deep learning

A branch of machine learning that utilizes algorithms consisting of multiple processing layers with complex structures or multiple non-linear transformations to create high-level abstractions of data.

DMA

direct memory access

A critical feature of modern computing that allows hardware devices of varying speeds to communicate directly with memory, bypassing the CPU to reduce heavy interrupt overhead.

DP

data parallelism

A common parallel strategy in large-scale deep learning training where each process (device) maintains a complete copy of the model and its parameters but processes a different subset of the data.

DPC

Distributed Parallel Client

DRAM

dynamic random access memory

A type of primary computer memory used to temporarily store data and instructions required by the CPU for processing.

DSL

domain specific language

An operator development method where developers express the computational logic through DSL interfaces. Subsequent tasks such as operator scheduling, optimization, and compilation are automatically handled by existing interfaces.

DSCP

differentiated services code point

Based on the Diff-Serv QoS classification standard, DSCP uses 6 bits of the Type of Service (ToS) byte in the IP header to differentiate traffic priorities. It combines the IP Precedence and Type of Service fields to maintain backward compatibility with older routers. Each DSCP value maps to a defined Per-Hop Behavior (PHB), allowing end devices to mark and classify traffic.

bandwidth

The range of frequencies that a transmission line or channel in a network can carry. It is the difference between the highest and lowest frequencies of the channel. Greater bandwidth typically results in faster data transmission rates.

single-operator comparison

A method of tensor comparison within accuracy comparison tools. It involves selecting one or more specific operators in a network model to analyze their computational accuracy.

E

Term/Acronym/Abbreviation

Meaning

ECC

error checking and correction

A technology that adds check bits to the original data bits to detect and correct data errors.

eMMC

embedded MultiMediaCard

A managed flash memory storage system. It features an external interface similar to an SD card, internal flash storage media, and an integrated bad block management system.

epoch

One complete pass of the entire dataset through the training algorithm.

EULA

End User License Agreement

ESN

equipment serial number

A unique string that identifies a device. It is a critical key for binding licenses to specific hardware, also known as a "device fingerprint."

EndPoint

An inference serving protocol and API wrapper compatible with third-party framework interfaces such as Triton, OpenAI, TGI, and vLLM.

EP

Expert parallelism. A model parallelism technique that partitions parameters by assigning different experts in a Mixture of Experts (MoE) model to different devices. A gating mechanism routes inputs to specific experts, activating the corresponding devices.

F

Term/Acronym/Abbreviation

Definition

Faiss

An open-source library developed by Meta (formerly Facebook) for efficient similarity search and clustering of dense vectors.

FEC

forward error correction

A digital signal processing technique used to enhance data reliability by introducing redundant data, allowing the receiver to detect and correct errors during data transmissions.

FFT

fast Fourier transform

An algorithm that computes the discrete Fourier transform (DFT) or its inverse (IDFT). It converts signals from their original domain (often time or space) to the frequency domain, and vice versa.

FFTS

function flow task scheduler

A data-flow-driven parallel scheduling mechanism. It utilizes a subgraph data management unit (DMU) mechanism to eliminate unnecessary direct memory access (DMA) copy overhead and provides sub-task threading and inter-thread scheduling to maximize hardware parallelism across AI Cores or AI Vectors, achieving effective operator fusion.

Flash Attention

An IO-aware, exact attention algorithm used for model acceleration. It speeds up attention computations and reduces memory footprint without approximation.

It is widely implemented in LLMs such as Llama and GPT-3.

FLOPS

floating-point operations per second

A measure of computer performance, particularly in scientific computing fields involving heavy floating-point calculations. Note that the "S" stands for "second" and is not a plural indicator, thus it should never be omitted.

FP Point

Forward-propagation point. The starting position of forward operators within a training network's iteration trajectory.

FUSE

Filesystem in Userspace

An operating system mechanism that allows non-privileged users to create their own file systems without editing kernel code. It is supported in Linux through a kernel module and utilized by file systems like ZFS, GlusterFS, and Lustre.

G

Term/Acronym/Abbreviation

Definition

GDAT

gradient auto tuning

An optimization tool that minimizes communication tail latency by maximizing the parallelism between backward computation and gradient aggregation. In distributed training, fusion strategies for gradient aggregation operators impact the communication overhead after backward passes, thereby affecting overall cluster performance and scaling linearity.

GDB

GNU debugger. A standard debugging tool for monitoring the internal execution of programs or analyzing crashes. GDB supports the following four main operations to help locate defects:

  • Starting programs with specific parameters
  • Pausing execution under defined conditions
  • Inspecting program state upon termination/pause
  • Modifying program content to test fixes

GE

graph engine

A core component providing graph/operator intermediate representation (IR) as a secure and intuitive interface for model building. It allows users to build network models, define computational graphs and operators, and configure associated attributes.

GM

Global memory. The main memory on the device side. It serves as the external storage for the AI Core and is used for large-scale data, requiring optimized access patterns to maximize throughput.

gRPC

Google Remote Procedure Call

GRPO

Group Relative Policy Optimization

A reinforcement learning (RL) algorithm designed to enhance reasoning capabilities in LLMs. Unlike traditional RL methods that rely on external value functions, GRPO optimizes models by evaluating relative performance within groups of generated responses, significantly improving training efficiency.

management plane

The architectural layer or network segment where health status and monitoring interfaces reside.

H

Term/Acronym/Abbreviation

Definition

HCC

Huawei Compiler Collection

HCCL

Huawei Collective Communication Library

A library providing high-performance collective communication functions for distributed deep learning across multiple servers.

HCCP

Huawei Collective Communication Adaptive Protocol

A protocol layer providing cross-NPU communication capabilities while abstracting away differences in underlying transport protocols for upper-layer applications.

HCCS

Huawei Cache Coherence System

A system designed for high-speed interconnect between CPUs and NPUs.

HDC

host-device communication

A communication module deployed on both the host and device sides to facilitate data exchange between them.

HDK

hardware developer kit

HDR

high dynamic range

A technique used in imaging and audio to reproduce a greater range of luminosity or signal levels than standard digital techniques.

HPA

HorizontalPodAutoscaler

A Kubernetes feature that automatically scales the number of Pods in a workload (such as a Deployment or StatefulSet) based on observed CPU utilization or other selected metrics.

HPO

hyperparameter optimization

The process of automating the search for the optimal or near-optimal hyperparameters of a machine learning model, replacing manual tuning with algorithmic search strategies.

I

Term/Acronym/Abbreviation

Definition

ICS

Intellectual Collaborative Service

IOPS

input/output operations per second

An input/output performance measurement used to characterize computer storage devices.

IPC

IP camera

ISP

image signal processing

A method or specialized hardware unit used to process raw data from image sensors to render a high-quality digital image, ensuring compatibility across different sensor manufacturers.

ISV

independent software vendor

J

Term/Acronym/Abbreviation

Definition

JDK

Java Development Kit

A software development environment used for developing Java applications, containing a collection of tools and libraries.

JPEGD

JPEG decoder

A specialized hardware or software module that provides the capability to decode images from the JPEG format.

JPEGE

JPEG encoder

A specialized hardware or software module that provides the capability to encode images into the JPEG format.

K

Term/Acronym/Abbreviation

Definition

KMC

Key Management CBB

A module designed to facilitate code sharing and simplify development. It implements core functions such as encrypted key storage and encryption/decryption to enable rapid product integration.

KMC

Key Management Center

A centralized system used to manage and protect cryptographic keys. It provides secure key storage, distribution, rotation, backup, and recovery. The KMC keystore ensures key security and reliability while supporting multiple encryption algorithms and key lengths across various application scenarios.

KL divergence

Kullback-Leibler divergence

An accuracy comparison algorithm used to measure the difference between two probability distributions. Values range from 0 to infinity. A lower KL divergence indicates a closer match between the true and approximate distributions.

Kubernetes

An open-source system for automating the deployment, scaling, and management of containerized applications. It provides a platform for automated deployment, scaling, and operation of application containers across clusters of hosts.

KASLR

Kernel address space layout randomization. A security mechanism that randomizes the memory address layout of the kernel, increasing the difficulty of exploiting kernel vulnerabilities.

L

Term/Acronym/Abbreviation

Definition

L0A buffer

An internal physical storage unit within the AI Core, typically used to store the left matrix for matrix multiplication. It corresponds to the logical memory AscendC::TPosition::A2.

L0B buffer

An internal physical storage unit within the AI Core, typically used to store the right matrix for matrix multiplication. It corresponds to the logical memory AscendC::TPosition::B2.

L0C buffer

An internal physical storage unit within the AI Core, typically used to store the results of matrix computation. It corresponds to the logical memory AscendC::TPosition::CO1.

L1 buffer

An internal physical storage unit within the AI Core with a relatively large capacity, typically used to cache input data for matrix multiplication. Input data is generally moved from global memory (GM) to the L1 buffer, and then to the L0A and L0B buffers. It corresponds to logical memory AscendC::TPosition::A1 and AscendC::TPosition::B1.

L2 cache

level 2 cache

A secondary CPU cache used to provide faster access to frequently used data and instructions before accessing the main memory.

LLDP

Link Layer Discovery Protocol

A layer 2 discovery protocol defined in IEEE 802.1ab. It enables network management systems to quickly acquire layer 2 network topology and change information as the network scales.

LLM

large language model

A type of language model consisting of artificial neural networks with a massive number of parameters (typically billions or more), trained on large datasets of unlabeled text using self-supervised or semi-supervised learning.

local memory

The internal storage of the AI Core, including storage units such as the L1 buffer, L0A buffer, L0B buffer, L0C buffer, and unified buffer.

loss

The deviation between predicted values and actual values, serving as a primary metric in deep learning to evaluate model performance.

LTO

link time optimization

A type of program optimization performed by a compiler during the linking stage.

adjacency list

A common data structure in graph theory and computer science used to represent a graph, where each vertex stores a list or array of all other vertices to which it is connected.

LoRA

low-rank adaptation A parameter-efficient fine-tuning (PEFT) method for large-scale models.

M

Term/Acronym/Abbreviation

Definition

MAC

media access control

A data link layer protocol that manages how multiple devices share a common transmission medium to prevent data collisions.

Max Absolute Error

maximum absolute error

An accuracy comparison algorithm with a range from 0 to infinity. A value closer to 0 indicates higher similarity, while a larger value indicates a greater discrepancy.

Max RelativeError

maximum relative error

An accuracy comparison algorithm with a range from 0 to infinity. A value closer to 0 indicates higher similarity, while a larger value indicates a greater discrepancy.

MCU

microcontroller unit

An integrated circuit that integrates multiple functional modules such as the processor, memory, and input/output interfaces.

Mean Absolute Error

mean absolute error

An accuracy comparison algorithm with the result ranging from 0 to infinity.

  • If both the mean absolute error (MAE) and root mean square error (RMSE) tend to 0, it indicates that the measured value is closer to the actual value.
  • If the MAE tends towards 0 while the RMSE is larger, it indicates the presence of local outliers with excessively large values.
  • If the MAE is large and the RMSE is equal to or close to the MAE, it indicates that the overall deviation is highly concentrated.
  • If the MAE is large and the RMSE is significantly larger than the MAE, it indicates the presence of an overall deviation, and this overall deviation is scattered.
  • There are no exceptions to the above cases because RMSE ≥ MAE always holds true.

Mean Relative Error

mean relative error

An accuracy comparison algorithm with results ranging from 0 to infinity. A result closer to 0 indicates higher similarity.

MemFS

memory file system

MindIE

Mind Inference Engine A high-performance deep learning inference framework optimized for Ascend hardware, supporting acceleration, debugging, tuning, and rapid deployment.

MindFormers

MindSpore Transformers. An end-to-end suite based on the MindSpore framework.

It supports the entire lifecycle of LLMs, including training, fine-tuning, evaluation, and deployment.

MindIO

A memory-based caching system designed to accelerate the read/write speeds of training checkpoints.

MinIO

An object storage service component.

MLP

multilayer perceptron

A feed-forward artificial neural network consisting of an input layer, one or more hidden layers, and an output layer. MLP can be used to solve a variety of problems, such as classification and regression. Due to its powerful representational capabilities, MLP is widely applied in many fields, including image recognition, natural language processing, and more.

MoE

Mixture of experts. It is a technology used to train models with trillions of parameters. MoE decomposes predictive modeling tasks into several sub-tasks, training an expert model for each sub-task and developing a gating model. This gating model assigns one or more experts based on the input data, and finally integrates the calculation results from multiple experts to produce the prediction result.

msDebug

An operator debugging tool.

It provides native environment debugging on Ascend processors, featuring flexible variable inspection and step-by-step execution.

msKPP

A performance modeling and tuning tool designed for operator theoretical performance and template libraries.

In the performance modeling phase, the tool utilizes built-in operator API performance data, enabling users to express implementation algorithms and evaluate performance during the initial design stage.

In the template library tuning phase, it provides capabilities for the generation, compilation, and execution of template library kernel dispatch code. Additionally, it supports code replacement within the kernel combined with automatic performance tuning.

msProf

An operator profiling tool.

It collects performance data from both hardware and simulation, visualized through MindStudio Insight to identify performance bottlenecks.

msproftx

msProf tool extension. An extension for the MindStudio system profiling tools.

msSanitizer

An operator anomaly detection tool.

It provides memory detection and contention detection capabilities, supporting precise localization of memory issues in multi-core programs.

MTE

memory transfer engine

Also known as the load-store unit (LSU), it manages data read/write between different buffers within the AI Core and handles format conversions.

MTE1

Memory transfer engine 1. Tiered memory transfer engine responsible for data movement from the L1 buffer to the L0A buffer or L0B buffer based on hardware capabilities.

MTE2

Memory transfer engine 2. Tiered memory transfer engine responsible for data movement from the global memory to the L1 buffer, L0A buffer, L0B buffer, or unified buffer based on hardware capabilities.

MTE3

Memory transfer engine 3. Tiered memory transfer engine responsible for data movement from the unified buffer to the global memory or L1 buffer based on hardware capabilities.

MTU

maximum transmission unit

The maximum data packet size that can be transmitted over a network. The size varies depending on the network type. For example, it is 576 bytes in X.25 networks, 1500 bytes in Ethernet, and 17,914 bytes in 16Mbit/s Token Ring. The MTU size is determined by the link layer of the network. When a packet is transmitted across a network, the path MTU (PMTU) determines the smallest packet size among all involved networks, representing the maximum packet size that can be transmitted across the entire path without fragmentation.

MindIE SD

Mind Inference Engine Stable Diffusion, a suite of visual generation inference models within the MindIE ecosystem.

MindIE Turbo

Mind Inference Engine Turbo, an acceleration plugin library developed for LLM inference on Ascend hardware.

MindIE Motor

Mind Inference Engine Motor, a request scheduling framework specifically designed for LLM PD (prefill-decode) disaggregation inference. It provides inference serving capabilities through an open and extensible platform architecture, and interfaces downstream with MindIE LLM to meet the high-performance inference requirements of large language models.

MindIE LLM

Mind Inference Engine Large Language Model, the dedicated inference component for large language models within the MindIE framework.

MLA

Multi-head Latent Attention, an efficient attention mechanism that uses low-rank KV joint compression to eliminate KV cache bottlenecks during inference.

MTP

Multi-token prediction, a parallel decoding method introduced by DeepSeek to generate multiple tokens in a single step. The core logic is that the model does not limit itself to predicting only the next single token. Instead, it predicts multiple subsequent tokens simultaneously, significantly accelerating model generation speeds.

MindIE Service Tools

A toolset for Ascend inference services, featuring performance/accuracy testing, visualization, automated optimization, and configurable throughput.

MindIE Simulator

An automated service performance tuning tool that simulates various strategies to find optimal parameters under latency constraints.

N

Term/Acronym/Abbreviation

Definition

NCS

Neural Compute Server

NCS encapsulates AscendCL runtime service interfaces to accept remote hardware execution requests and return corresponding performance data.

NIC

network interface controller

Also known as network interface card, network adapter, LAN adapter, or other similar terms. It refers to a hardware component that connects a computer to a computer network.

NLP

natural language processing

A subdiscipline of artificial intelligence and linguistics that explores how to process and utilize natural language. NLP involves various aspects and stages, primarily including cognition, understanding, and generation.

NN

neural network

In the fields of machine learning and cognitive science, a neural network is a mathematical model or computing model that emulates the structure and functions of a biological neural network.

NPU

Neural-Network Processing Unit. Utilizing a "data-driven parallel computing" architecture, it is specifically designed to handle massive computational tasks in AI applications.

NUMA

non-uniform memory access

NUMA is a distributed memory access architecture where processors can access different memory addresses simultaneously to significantly enhance parallelism. In this mode, processors are divided into multiple nodes, with each node allocated its own local memory space. While processors in any node can access all physical memory, the latency for accessing local memory is much lower than that for accessing remote nodes.

NVMe

Non-Volatile Memory Express

A logical device interface specification. It is a bus transport protocol based on a logical device interface (equivalent to the application layer in communication protocols), used to access non-volatile storage media (such as flash-based SSDs) attached through the PCI Express (PCIe) bus.

O

Term/Acronym/Abbreviation

Definition

OM

offline model

ONNX

Open Neural Network Exchange

ONNX is an open-source file format designed for machine learning to store trained models. It enables different AI frameworks to share and exchange model data using a unified format.

OOM

out of memory

OP

Operator. An operator is the fundamental unit for executing specific mathematical calculations or operations within deep learning algorithms, such as activation functions (for example, ReLU), convolution, pooling, and normalization (for example, Softmax). Neural network models are constructed by combining these operators.

OPAT

operator auto-tuning

OPAT is an optimizer designed to enhance operator performance. When AOE inputs a full graph into OPAT, OPAT performs operator fusion internally and partitions the fused graph at the operator level. It generates different tuning strategies for each fused operator subgraph to achieve optimal performance, subsequently saving these strategies in the operator knowledge base.

OpenPGP

Open Pretty Good Privacy

Pretty Good Privacy (PGP) is an encryption program that provides cryptographic privacy and authentication for data communication. It is commonly used for signing, encrypting, and decrypting texts, emails, and files. OpenPGP is a non-proprietary protocol that defines a unified standard for encrypted messages, signatures, private keys, and certificates used for public key exchange.

OPP

operator package

OS

operating system

OS

optimizer state

OCI

Open Container Initiative

Established by the Linux Foundation in June 2015, the OCI aims to create open industry standards for container formats and runtimes.

OM Adapter

Reports MindIE heartbeats, alarms, resource information, and logs to external alarm and management platforms, enabling service status monitoring and integration with management systems.

P

Term/Acronym/Abbreviation

Definition

PCIe

Peripheral Component Interconnect Express

A high-speed serial expansion bus standard commonly used for peripheral expansion in computer systems.

PCB

printed circuit board

PFC

priority-based flow control

A flow control mechanism based on priorities.

PMU

performance monitor unit

A hardware unit provided by the CPU that enables the reading of CPU performance data by accessing relevant registers.

PNGD

PNG decoder

A component that provides the capability to decode images in PNG format.

Pod

The smallest deployable unit that can be created in Kubernetes and a top-level resource type in the Kubernetes REST API.

PP

pipeline parallelism

A technique that distributes different layers of a model across various computing devices to reduce individual device memory consumption, enabling the training of ultra-large-scale models.

PWM

pulse width modulation

A modulation technique where the pulse duration (width) of a pulse carrier varies according to the sample values of the modulating wave.

on-chip memory

Memory integrated directly onto a microprocessor chip.

Q

Term/Acronym/Abbreviation

Definition

QAT

quantization-aware training

A quantization method that introduces quantization during the retraining process, enhancing the model's robustness to quantization effects through retraining to achieve higher accuracy in the quantized model.

R

Term/Acronym/Abbreviation

Definition

RDMA

Remote direct memory access, a technology that transfers data directly from the memory of one machine to another without involving the operating systems of either host. It generally refers to a memory access method that spans across a network.

RED

relative Euclidean distance

An accuracy comparison algorithm. The computation result ranges from 0 to infinity. A result value closer to 0 indicates higher similarity, while a larger result value indicates a greater discrepancy.

RoCE

RDMA over Converged Ethernet

A network protocol that enables remote direct memory access (RDMA) over Ethernet. There are currently two versions: RoCE v1 and RoCE v2. RoCE v1 is a data link layer protocol that allows communication between any two hosts within the same Ethernet broadcast domain. RoCE v2 is a network layer protocol and its packets can be routed.

RMSE

root mean square error

An accuracy comparison algorithm. The result ranges from 0 to infinity.

  • If both the mean absolute error (MAE) and root mean square error (RMSE) tend to 0, it indicates that the measured value is closer to the actual value.
  • If the MAE tends towards 0 while the RMSE is larger, it indicates the presence of local outliers with excessively large values.
  • If the MAE is large and the RMSE is equal to or close to the MAE, it indicates that the overall deviation is highly concentrated.
  • If the MAE is large and the RMSE is significantly larger than the MAE, it indicates the presence of an overall deviation, and this overall deviation is scattered.
  • There are no exceptions to the above cases because RMSE ≥ MAE always holds true.

Runtime

Provides applications with functions such as memory management, device management, stream management, event management, and kernel loading and execution specifically for Ascend AI processors.

RAM

random access memory

A type of semiconductor-based memory that can be read and written by the CPU or other hardware devices. The storage locations can be accessed in any order.

runC

A client tool for creating and running containers according to the OCI (Open Container Initiative) specification.

RoPE

rotary position embedding

A position encoding method that integrates relative position dependencies into self-attention, enhancing the performance of the Transformer architecture.

RAS

Reliability, availability, and serviceability. It refers to capabilities that enhance the reliability, availability, and serviceability of prefill-decode (PD) disaggregation services.

S

Term/Acronym/Abbreviation

Definition

scalar

The scalar computing unit within the AI Core. It is primarily responsible for scalar data operations and issuing instructions to other units, such as the memory transfer engine (MTE), vector unit, and cube unit.

SDMA

System direct memory access, also known as direct memory access (DMA). This technology allows peripheral devices to access system memory directly without CPU intervention.

SiP

Ascend Signal Processing Boost

A signal processing acceleration library that provides a series of high-performance operators for AI models (supporting PyTorch calls) and signal processing (supporting direct C++ calls).

SGAT

subgraph auto-tuning

SGAT is an optimizer that improves the performance of subgraphs. A complete network can be partitioned into multiple subgraphs. SGAT can be used to generate different tiling policies for those subgraphs. By acquiring performance data for each iteration, the SGAT algorithm identifies the optimal tuning strategy to achieve peak performance for the corresponding subgraph.

SPI

serial peripheral interface

A synchronous serial communication interface that enables information exchange between the microcontroller unit (MCU) and peripherals.

SP

sequence parallelism

A parallel computing method that performs column partitioning on input sequences to further improve efficiency on top of tensor parallelism (TP). Since it does not introduce additional communication overhead, it is recommended to enable SP concurrently with TP.

SRAM

static random access memory

A type of computer memory that is faster and more reliable than common DRAM. It is typically used for caches, registers, and other applications requiring high-speed access.

SwiGLU

Swish-Gated Linear Units

An activation function variant of gated linear units (GLU) that incorporates the Swish activation function.

SIMD

single instruction multiple data

A parallel computing architecture where a single instruction processor fetches an instruction and distributes it to multiple processing elements to operate on different data points simultaneously.

SSL

Secure Sockets Layer

A security protocol operating at the socket layer, situated between the TCP and application layers. It is used for data encryption/decryption and entity authentication.

standard deviation

standard deviation

An accuracy comparison algorithm with a result ranging from 0 to infinity. A smaller standard deviation indicates lower dispersion, meaning values are closer to the mean.

STARS

system task and resource scheduler

spine-leaf

A two-tier network topology consisting of leaf switches and spine switches. Spine switches act as the core, typically using high-port-density switches rather than traditional large chassis switches. Leaf switches serve as the access layer, providing connectivity to endpoints and servers while up-linking to the spine switches. This topology is designed to handle rapid traffic growth and large-scale data center expansion, overcoming the limitations of traditional three-tier architectures in high-speed internal interconnection.

sample-based

A profiling method where AI Core performance data is collected at fixed time intervals (AI Core-sampling interval).

step trace

Iteration trace.

It captures start and end times for forward and backward passes, gradient updates, and data augmentation trailing phases.

ST

system test

A black-box testing phase based on the system requirement specifications, covering all integrated components. It evaluates the complete product system to verify compliance with requirement specifications and identify discrepancies.

The scope includes not only the software but also the underlying hardware, peripherals, data, support software, and interfaces, requiring testing within the system's actual operating environment.

Ascend Virtual Instance (AVI)

It refers to the use of resource virtualization technology to partition a single NPU into multiple virtual NPU (vNPU) instances. These instances can be mounted to VMs or containers, allowing one NPU to support multiple concurrent tasks. By partitioning compute resources, AVI enables virtualized reuse and ensures secure isolation, significantly reducing the cost and entry barrier for NPU utilization while supporting on-demand multi-tenant resource management.

SLO

service-level objective A target value or range of values for a specific service level that is measured over a predefined period.

T

Term/Acronym/Abbreviation

Definition

task-based

A profiling method where AI Core performance data is collected at the task level.

TCP

Transmission Control Protocol

TDP

thermal design power

The maximum amount of heat generated by a computer chip or component that the cooling system in a computer is designed to dissipate under any workload.

tensor

A container for data used in operator computations. It is an N-dimensional data structure, most commonly represented as a scalar, vector, or matrix. Tensor elements can include integers, floating-point values, or strings.

TFT

training fault tolerance

TIK

Tensor Iterator Kernel

An operator development method that allows developers to write custom operators using Python-based APIs provided by TIK. The TIK compiler converts these into binary files compatible with Ascend AI Processor applications.

TGI

Text Generation Inference

A toolkit for deploying and serving LLMs. TGI enables high-performance text generation for popular open-source LLMs, including Llama, Falcon, StarCoder, BLOOM, and GPT-NeoX.

TLS

Transport Layer Security

TP

tensor parallelism

A technique that partitions tensors within a network across different devices to reduce memory consumption per device, enabling the training of ultra-large-scale models.

Triton

Triton Inference Server

An open-source inference serving software that lets teams deploy, run, and scale trained AI models from any framework on any GPU- or CPU-based infrastructure (cloud, data center, or edge).

TTP

try to persist

TPOT

Time per output token. The latency for each output token, excluding the first token. In offline batch processing applications, TPOT is a critical metric as it determines the overall duration of the inference process.

TTFT

Time to first token. A key metric in LLM inference representing the latency from the initial input to the generation of the first output token.

U

Term/Acronym/Abbreviation

Definition

UCE

uncorrectable memory error

Unified Buffer

An internal storage unit within the AI Core, primarily used for vector computations.

UDF

user-defined function

UUID

universally unique identifier

A standard used in software construction and part of the distributed computing environment (DCE) as defined by the Open Software Foundation (OSF).

UT

unit test

The lowest level of testing performed during software development, where individual units of software are tested in isolation from the rest of the application.

UDP

User Datagram Protocol

A protocol in the TCP/IP suite that provides a simple interface between the network and application layers. UDP provides unreliable data transfer. Once data is sent to the network layer, no backup is retained. It adds only multiplexing and checksum fields to the IP datagram header.

V

Term/Acronym/Abbreviation

Definition

vcjob

VolcanoJob, a job type managed by Volcano, a batch scheduling system for Kubernetes.

VDEC

video decoder

A component that provides the capability to decode video streams of specific formats.

VENC

video encoder

A component that provides the capability to encode images into video streams of specific formats.

vector

The vector computing unit within the AI Core, responsible for performing vector operations. It offers lower computing power than the cube unit but higher flexibility (for example, supporting reciprocal and square root operations).

vLLM

An open-source high-throughput serving and memory-efficient inference engine for LLMs.

VPC

vision preprocessing core

A hardware unit for processing images in formats such as YUV and RGB, supporting functions like resizing, cropping, image pyramid generation, and color space conversion.

W

Term/Acronym/Abbreviation

Definition

watchdog

watchdog

A hardware device (typically a timer or driver) used to monitor whether a continuously running system is functioning correctly. It communicates with system software through dedicated drivers.

As a timer used to monitor software resource states, it starts counting automatically after the program launches. The program must periodically reset the counter (known as "feeding the dog"). If the counter overflows due to a timeout, a watchdog interrupt is triggered, causing a system reset to prevent infinite loops or hangs.

Y

Term/Acronym/Abbreviation

Definition

service plane

The plane where inference and other service interfaces reside.

Z

Term/Acronym/Abbreviation

Definition

ZeRO

Zero Redundancy Optimizer

An optimizer designed to address memory bottlenecks in large-scale distributed training. It optimizes memory usage by eliminating redundant data, enabling the training of larger models. Compared with traditional data parallelism, ZeRO significantly improves memory efficiency while maintaining computation granularity and communication efficiency.

network-wide comparison

A method of tensor comparison within accuracy comparison tools. It involves performing accuracy comparisons across all operators involved in computations within a network model.