Introduction

Overview

MindIE Turbo is Huawei's acceleration plugin library for LLM inference engines on Ascend hardware. It is planned to provide proprietary optimization algorithms for LLMs and framework-level enhancements. It offers modular and plugin interfaces to seamlessly integrate and accelerate third-party inference engines. Optimized module capabilities are migrated to the community repository in different MindIE versions based on the actual evolution. These migration operations will be described separately.

Unlike MindIE Turbo 2.0.RC2, MindIE Turbo 2.1.RC1 migrates the W8A8-related quantization capability in the Quantize module to vLLM Ascend. The attention quantization capability in the Quantize module is supported only in MindIE Turbo 2.0.RC2 and is not migrated to 2.1.RC1.
Unlike MindIE Turbo 2.0.RC2, MindIE Turbo 2.1.RC1 migrates the high-performance operator enablement capability to vLLM Ascend.

MindIE Turbo Architecture

Figure 1 MindIE Turbo architecture

Framework

vLLM: vLLM is an open-source high-speed inference framework for LLMs. It aims to greatly improve the throughput and memory usage of language model services in real-time scenarios and provide easy-to-use, fast, and low-cost LLM services. Currently, MindIE Turbo can easily adapt to the vLLM framework through vLLM Ascend for inference acceleration. This is achieved through vLLM Adaptor shown in the architecture diagram.

Application Scenario

MindIE Turbo is a Huawei-developed performance plugin that is designed to provide optimization algorithms and inference framework enhancement. Currently, it supports adaptation to vLLM. By interconnecting with vLLM and vLLM Ascend, MindIE Turbo offers enhanced performance and more inference optimization algorithms.

In real-world scenarios, you only need to install MindIE Turbo in the appropriate Python environment. During the execution of vLLM, vLLM Ascend automatically detects MindIE Turbo and activates it. MindIE Turbo replaces or decorates implementation of some interfaces of vLLM and vLLM Ascend through patches for performance optimization—no code modifications required.