Introduction

Overview

Mind Inference Engine Large Language Model (MindIE LLM) is an inference component designed for large language models (LLMs) in the MindIE solution. It provides common LLM inference capabilities based on the Ascend hardware, schedules multiple concurrent requests, and supports acceleration features such as Continuous Batching, PageAttention, and FlashDecoding, enabling high-performance inference.

MindIE LLM provides C++ APIs for LLM inference and scheduling.

This document helps you quickly understand MindIE LLM, and deploy and test LLM inference.

MindIE LLM Architecture

Figure 1 MindIE LLM architecture

MindIE LLM consists of four layers: Server, LLM Manager, Text Generator, and Modeling.

  • Server: inference server that provides model inference serving capabilities. EndPoint provides RESTful APIs for inference service developers, encapsulates inference serving protocols and APIs, and supports request APIs of mainstream inference frameworks, such as Triton, OpenAI, TGI, and vLLM.
  • LLM Manager: manages status and schedules tasks, implements batch processing of user requests based on scheduling policies, manages KV cache in a unified memory pool, returns inference results, and provides status recording APIs.
    • LLM Manager Interface: external API of the MindIE LLM inference engine.
    • Engine: connects the scheduler and executor to implement inference processing for requests in multiple scenarios through component collaboration.
    • Scheduler: implements batch request processing during the prefill or decode phase in a DP domain to fully utilize computing and communication resources.
    • Block Manager: manages KV resources in a DP domain and supports location awareness for offloaded KV after pooling.
    • Executor: distributes the scheduled information to the Text Generator module and supports cross-server and -device task delivery.
  • Text Generator: executes model configuration, initialization, loading, autoregressive inference, and postprocessing, provides a unified autoregressive inference API for the LLM Manager, and supports parallel decoding plugin running.
    • Preprocess: converts scheduled tasks into model inputs.
    • Generator: abstraction of the model running process.
    • Sampler: selects tokens, determines the stop conditions, updates and clears context for the model's output logits.
  • Modeling: provides modules and built-in models after performance tuning, and supports the following two frameworks: Ascend Transformer Boost Models (ATB Models) and MindSpore Models.

    • The built-in modules include Attention, Embedding, ColumnLinear, RowLinear, and Multilayer Perceptron (MLP), which support online splitting and loading of weight tensors.

    • The built-in models use built-in modules for networking combinations, supporting tensor splitting and multiple quantization modes. You can also customize a model based on the built-in module networking by referring to the sample.

    • After the networking combinations are complete and the models are compiled and optimized, executable graphs that can accelerate inference on the Ascend NPU are generated.

Functions and Features

The MindIE LLM features include basic model capabilities and scheduling-related capabilities. For details, see Feature List.

  • Introduction to basic model capabilities
    1. Basic capabilities include floating point, quantization, and parallelism.
      Table 1 Floating-point feature

      Floating-Point Feature

      Floating-Point Capability

      float16

      bfloat16

      MindIE LLM focuses on high-performance inference. Therefore, it supports the float16 and bfloat16 floating-point formats only. You can change the type by setting the torch_dtype field in the config.json file of your model.

      Table 2 Quantitative features

      Quantitative Feature

      Per Channel

      Per Token

      Per Group

      W8A8

      ×

      W8A16

      ×

      KV Cache INT8

      ×

      ×

      W8A8 sparse quantization

      ×

      ×

      MindIE LLM provides multiple quantization options for inference acceleration. You can select an option as required. For details about how to obtain the quantization weight and run quantization inference, see Quantization.

      Table 3 Parallelism features

      Parallelism Feature

      Parallelism Capability

      Tensor parallelism (TP)

      Data parallelism (DP)

      Pipeline parallelism (PP)

      ×

      Expert parallelism (EP)

      Context parallelism (CP)

      Sequence parallelism (SP)

      MindIE LLM provides the following parallelism strategies: TP, DP, EP, CP, and SP.

    2. Model capabilities

      MindIE LLM supports the following preset models. You can use them as required or customize and migrate your own model.

      • Llama
      • Baichuan
      • Mixtral
      • Qwen
      • BLOOM
      • DeepSeek
      • GLM
  • Introduction to scheduling-related capabilities
    Table 4 Serving features

    Serving Feature

    Serving Capability

    MindIE Motor