Feature List

The MindIE LLM supports the following features: basic features, long sequence features, scheduling features, acceleration features, and interaction features.

Basic Features

For details about the basic features, see Table 1.

Table 1 Basic Features

Type

Name

Description

Quantization

Quantization

Reduces the model numerical precision, thereby decreasing the model size, increasing the inference speed, and lowering energy consumption. For details, see Quantization.

Multimodal understanding

Multimodal understanding

A deep learning model that can process and understand data of multiple modalities. For details, see Multimodal Understanding.

Multi-LoRA

Multi-LoRA

Executes the basic model and uses different LoRA weights for inference. For details, see Multi-LoRA.

MoE

MoE

Introduces sparsely activated expert network to substantially expand the model parameter scale without significantly increasing the computing cost, thereby improving the model capability. For details, see MoE.

Load balancing

Reduces the imbalance among NPUs and improves model inference performance. For details, see Load Balancing.

External deployment of shared experts

Deploys shared experts on an independent NPU so that they are separated from routed and redundancy experts. For details, see External Shared Experts.

MLA

Uses the low-rank key-value joint compression to eliminate the bottleneck of key-value cache during inference, thereby supporting efficient inference. For details, see MLA.

Parallelism policies

Expert parallelism

Deploys experts on different devices to implement expert-level parallel computing. For details, see Expert Parallelism.

Data parallelism

Divides inference requests into multiple batches and allocates them to different devices for parallel processing. For details, see Data Parallelism.

Tensor parallelism

Splits tensors (such as weight matrices and activation values) among multiple devices (such as NPUs) to implement distributed model inference. For details, see Tensor Parallelism.

Long Sequence Features

For details about the long sequence features, see Table 2.

Table 2 Long sequence features

Feature

Description

Context parallelism

Splits long sequences in the context dimension, allocates the sequences to different devices for parallel processing, and reduces the response time of the first token. For details, see Context Parallelism.

Sequence parallelism

Splits the KV cache so that the KV cache saved by each sprank is different, reducing the video memory and supporting the long sequence feature. For details, see Sequence Parallelism.

Scheduling Features

For details about the scheduling features, see Table 3.

Table 3 Scheduling features

Feature

Description

Asynchronous scheduling

Masks the time consumed in the data preparation phase and data return phase with the time consumed in the model inference phase in scenarios where maxBatchSize has a large value and the input and output lengths are long. This function prevents waste of NPU computing resources and video memory resources. For details, see Asynchronous Scheduling.

SplitFuse

Divides a long prompt into smaller chunks and schedules them in multiple forward steps to reduce the prefill latency. For details, see SplitFuse.

SLO scheduling tuning

Improves the system throughput while ensuring the SLO in response to high concurrency requests from clients. For details, see SLO Scheduling Optimization.

Acceleration Features

For details about the acceleration features, see Table 4.

Table 4 Acceleration features

Feature

Description

Micro batch

Divides data into multiple batches with smaller granularities for batch processing, fully utilizing hardware resources and improving inference throughput. For details, see Micro Batch.

Buffer response

Configures the expected SLO latency in both the prefill and decode phases to balance the latency of the two phases and maximize benefits without timeout. For details, see Buffer Response.

Parallel decoding

Uses the computing power advantages to offset the impact of limited memory bandwidth to improve computing power usage. For details, see Parallel Decoding.

MTP

Forecasts not just the subsequent token but several tokens concurrently during inference, which significantly enhances generation efficiency of the model. For details, see MTP.

Prefix cache

Reuses the KV cache corresponding to the repeated token sequence across sessions to reduce the KV cache calculation time for some prefix tokens, thereby reducing the prefill time. For details, see Prefix Cache.

KV cache pooling

Allows larger-capacity storage media such as DRAM and SSDs to be added to the prefix cache pool, thereby breaking the capacity limit of the on-chip memory. This feature effectively improves the prefix cache hit ratio and significantly reduces the cost of LLM inference. For details, see KV Cache Pooling.

Interaction Features

For details about the interaction features, see Table 5.

Table 5 Interaction features

Feature

Description

Function call

Enables the LLM to use tools. For details, see Function Call.

Thinking analysis

Structurally analyzes the output content of the LLM and separates the thinking process from the output result. For details, see Thinking Analysis.