Introduction

Overview

Mind Inference Engine Large Language Model (MindIE LLM) is an inference component designed for large language models (LLMs) in the MindIE solution. It provides common LLM inference capabilities based on the Ascend hardware, schedules multiple concurrent requests, and supports acceleration features such as Continuous Batching, PageAttention, and FlashDecoding, enabling high-performance inference.

MindIE LLM provides C++ APIs for LLM inference and scheduling.

This document helps you quickly understand MindIE LLM, and deploy and test LLM inference.

MindIE LLM Architecture

Figure 1 MindIE LLM architecture

MindIE LLM consists of four layers: Server, LLM Manager, Text Generator, and Modeling.

Server: inference server that provides model inference serving capabilities. EndPoint provides RESTful APIs for inference service developers, encapsulates inference serving protocols and APIs, and supports request APIs of mainstream inference frameworks, such as Triton, OpenAI, TGI, and vLLM.

LLM Manager: manages status and schedules tasks, implements batch processing of user requests based on scheduling policies, manages KV cache in a unified memory pool, returns inference results, and provides status recording APIs.
- LLM Manager Interface: external API of the MindIE LLM inference engine.
- Engine: connects the scheduler and executor to implement inference processing for requests in multiple scenarios through component collaboration.
- Scheduler: implements batch request processing during the prefill or decode phase in a DP domain to fully utilize computing and communication resources.
- Block Manager: manages KV resources in a DP domain and supports location awareness for offloaded KV after pooling.
- Executor: distributes the scheduled information to the Text Generator module and supports cross-server and -device task delivery.
Text Generator: executes model configuration, initialization, loading, autoregressive inference, and postprocessing, provides a unified autoregressive inference API for the LLM Manager, and supports parallel decoding plugin running.
- Preprocess: converts scheduled tasks into model inputs.
- Generator: abstraction of the model running process.
- Sampler: selects tokens, determines the stop conditions, updates and clears context for the model's output logits.
Modeling: provides modules and built-in models after performance tuning, and supports the following two frameworks: Ascend Transformer Boost Models (ATB Models) and MindSpore Models.
- The built-in modules include Attention, Embedding, ColumnLinear, RowLinear, and Multilayer Perceptron (MLP), which support online splitting and loading of weight tensors.
- The built-in models use built-in modules for networking combinations, supporting tensor splitting and multiple quantization modes. You can also customize a model based on the built-in module networking by referring to the sample.
- After the networking combinations are complete and the models are compiled and optimized, executable graphs that can accelerate inference on the Ascend NPU are generated.

Functions and Features

The MindIE LLM features include basic model capabilities and scheduling-related capabilities. For details, see Feature List.

Introduction to basic model capabilities

Basic capabilities include floating point, quantization, and parallelism.

**Table 1** Floating-point feature
Floating-Point Feature	Floating-Point Capability
float16	√
bfloat16	√

MindIE LLM focuses on high-performance inference. Therefore, it supports the float16 and bfloat16 floating-point formats only. You can change the type by setting the torch_dtype field in the config.json file of your model.

**Table 2** Quantitative features
Quantitative Feature	Per Channel	Per Token	Per Group
W8A8	√	√	×
W8A16	√	×	√
KV Cache INT8	√	×	×
W8A8 sparse quantization	√	×	×

MindIE LLM provides multiple quantization options for inference acceleration. You can select an option as required. For details about how to obtain the quantization weight and run quantization inference, see Quantization.

**Table 3** Parallelism features
Parallelism Feature	Parallelism Capability
Tensor parallelism (TP)	√
Data parallelism (DP)	√
Pipeline parallelism (PP)	×
Expert parallelism (EP)	√
Context parallelism (CP)	√
Sequence parallelism (SP)	√

MindIE LLM provides the following parallelism strategies: TP, DP, EP, CP, and SP.

Model capabilities
MindIE LLM supports the following preset models. You can use them as required or customize and migrate your own model.
- Llama
- Baichuan
- Mixtral
- Qwen
- BLOOM
- DeepSeek
- GLM

Introduction to scheduling-related capabilities
Table 4 Serving features
Serving Feature

Serving Capability

MindIE Motor

√