Feature List
The MindIE LLM supports the following features: basic features, long sequence features, scheduling features, acceleration features, and interaction features.
Basic Features
For details about the basic features, see Table 1.
Type |
Name |
Description |
|---|---|---|
Quantization |
Quantization |
Reduces the model numerical precision, thereby decreasing the model size, increasing the inference speed, and lowering energy consumption. For details, see Quantization. |
Multimodal understanding |
Multimodal understanding |
A deep learning model that can process and understand data of multiple modalities. For details, see Multimodal Understanding. |
Multi-LoRA |
Multi-LoRA |
Executes the basic model and uses different LoRA weights for inference. For details, see Multi-LoRA. |
MoE |
MoE |
Introduces sparsely activated expert network to substantially expand the model parameter scale without significantly increasing the computing cost, thereby improving the model capability. For details, see MoE. |
Load balancing |
Reduces the imbalance among NPUs and improves model inference performance. For details, see Load Balancing. |
|
External deployment of shared experts |
Deploys shared experts on an independent NPU so that they are separated from routed and redundancy experts. For details, see External Shared Experts. |
|
MLA |
Uses the low-rank key-value joint compression to eliminate the bottleneck of key-value cache during inference, thereby supporting efficient inference. For details, see MLA. |
|
Parallelism policies |
Expert parallelism |
Deploys experts on different devices to implement expert-level parallel computing. For details, see Expert Parallelism. |
Data parallelism |
Divides inference requests into multiple batches and allocates them to different devices for parallel processing. For details, see Data Parallelism. |
|
Tensor parallelism |
Splits tensors (such as weight matrices and activation values) among multiple devices (such as NPUs) to implement distributed model inference. For details, see Tensor Parallelism. |
Long Sequence Features
For details about the long sequence features, see Table 2.
Feature |
Description |
|---|---|
Context parallelism |
Splits long sequences in the context dimension, allocates the sequences to different devices for parallel processing, and reduces the response time of the first token. For details, see Context Parallelism. |
Sequence parallelism |
Splits the KV cache so that the KV cache saved by each sprank is different, reducing the video memory and supporting the long sequence feature. For details, see Sequence Parallelism. |
Scheduling Features
For details about the scheduling features, see Table 3.
Feature |
Description |
|---|---|
Asynchronous scheduling |
Masks the time consumed in the data preparation phase and data return phase with the time consumed in the model inference phase in scenarios where maxBatchSize has a large value and the input and output lengths are long. This function prevents waste of NPU computing resources and video memory resources. For details, see Asynchronous Scheduling. |
SplitFuse |
Divides a long prompt into smaller chunks and schedules them in multiple forward steps to reduce the prefill latency. For details, see SplitFuse. |
SLO scheduling tuning |
Improves the system throughput while ensuring the SLO in response to high concurrency requests from clients. For details, see SLO Scheduling Optimization. |
Acceleration Features
For details about the acceleration features, see Table 4.
Feature |
Description |
|---|---|
Micro batch |
Divides data into multiple batches with smaller granularities for batch processing, fully utilizing hardware resources and improving inference throughput. For details, see Micro Batch. |
Buffer response |
Configures the expected SLO latency in both the prefill and decode phases to balance the latency of the two phases and maximize benefits without timeout. For details, see Buffer Response. |
Parallel decoding |
Uses the computing power advantages to offset the impact of limited memory bandwidth to improve computing power usage. For details, see Parallel Decoding. |
MTP |
Forecasts not just the subsequent token but several tokens concurrently during inference, which significantly enhances generation efficiency of the model. For details, see MTP. |
Prefix cache |
Reuses the KV cache corresponding to the repeated token sequence across sessions to reduce the KV cache calculation time for some prefix tokens, thereby reducing the prefill time. For details, see Prefix Cache. |
KV cache pooling |
Allows larger-capacity storage media such as DRAM and SSDs to be added to the prefix cache pool, thereby breaking the capacity limit of the on-chip memory. This feature effectively improves the prefix cache hit ratio and significantly reduces the cost of LLM inference. For details, see KV Cache Pooling. |
Interaction Features
For details about the interaction features, see Table 5.
Feature |
Description |
|---|---|
Function call |
Enables the LLM to use tools. For details, see Function Call. |
Thinking analysis |
Structurally analyzes the output content of the LLM and separates the thinking process from the output result. For details, see Thinking Analysis. |