Feature List

The MindIE LLM supports the following features: basic features, long sequence features, scheduling features, acceleration features, and interaction features.

Basic Features

For details about the basic features, see Table 1.

**Table 1** Basic Features
Type	Name	Description
Quantization	Quantization	Reduces the model numerical precision, thereby decreasing the model size, increasing the inference speed, and lowering energy consumption. For details, see Quantization.
Multimodal understanding	Multimodal understanding	A deep learning model that can process and understand data of multiple modalities. For details, see Multimodal Understanding.
Multi-LoRA	Multi-LoRA	Executes the basic model and uses different LoRA weights for inference. For details, see Multi-LoRA.
MoE	MoE	Introduces sparsely activated expert network to substantially expand the model parameter scale without significantly increasing the computing cost, thereby improving the model capability. For details, see MoE.
	Load balancing	Reduces the imbalance among NPUs and improves model inference performance. For details, see Load Balancing.
	External deployment of shared experts	Deploys shared experts on an independent NPU so that they are separated from routed and redundancy experts. For details, see External Shared Experts.
	MLA	Uses the low-rank key-value joint compression to eliminate the bottleneck of key-value cache during inference, thereby supporting efficient inference. For details, see MLA.
Parallelism policies	Expert parallelism	Deploys experts on different devices to implement expert-level parallel computing. For details, see Expert Parallelism.
	Data parallelism	Divides inference requests into multiple batches and allocates them to different devices for parallel processing. For details, see Data Parallelism.
	Tensor parallelism	Splits tensors (such as weight matrices and activation values) among multiple devices (such as NPUs) to implement distributed model inference. For details, see Tensor Parallelism.

Long Sequence Features

For details about the long sequence features, see Table 2.

**Table 2** Long sequence features
Feature	Description
Context parallelism	Splits long sequences in the context dimension, allocates the sequences to different devices for parallel processing, and reduces the response time of the first token. For details, see Context Parallelism.
Sequence parallelism	Splits the KV cache so that the KV cache saved by each sprank is different, reducing the video memory and supporting the long sequence feature. For details, see Sequence Parallelism.

Scheduling Features

For details about the scheduling features, see Table 3.

**Table 3** Scheduling features
Feature	Description
Asynchronous scheduling	Masks the time consumed in the data preparation phase and data return phase with the time consumed in the model inference phase in scenarios where maxBatchSize has a large value and the input and output lengths are long. This function prevents waste of NPU computing resources and video memory resources. For details, see Asynchronous Scheduling.
SplitFuse	Divides a long prompt into smaller chunks and schedules them in multiple forward steps to reduce the prefill latency. For details, see SplitFuse.
SLO scheduling tuning	Improves the system throughput while ensuring the SLO in response to high concurrency requests from clients. For details, see SLO Scheduling Optimization.

Acceleration Features

For details about the acceleration features, see Table 4.

**Table 4** Acceleration features
Feature	Description
Micro batch	Divides data into multiple batches with smaller granularities for batch processing, fully utilizing hardware resources and improving inference throughput. For details, see Micro Batch.
Buffer response	Configures the expected SLO latency in both the prefill and decode phases to balance the latency of the two phases and maximize benefits without timeout. For details, see Buffer Response.
Parallel decoding	Uses the computing power advantages to offset the impact of limited memory bandwidth to improve computing power usage. For details, see Parallel Decoding.
MTP	Forecasts not just the subsequent token but several tokens concurrently during inference, which significantly enhances generation efficiency of the model. For details, see MTP.
Prefix cache	Reuses the KV cache corresponding to the repeated token sequence across sessions to reduce the KV cache calculation time for some prefix tokens, thereby reducing the prefill time. For details, see Prefix Cache.
KV cache pooling	Allows larger-capacity storage media such as DRAM and SSDs to be added to the prefix cache pool, thereby breaking the capacity limit of the on-chip memory. This feature effectively improves the prefix cache hit ratio and significantly reduces the cost of LLM inference. For details, see KV Cache Pooling.

Interaction Features

For details about the interaction features, see Table 5.

**Table 5** Interaction features
Feature	Description
Function call	Enables the LLM to use tools. For details, see Function Call.
Thinking analysis	Structurally analyzes the output content of the LLM and separates the thinking process from the output result. For details, see Thinking Analysis.

Parent topic: Feature Description