Introduction

Mind Inference Engine Stable Diffusion (MindIE SD) aims to build Ascend affinity multimodal acceleration suites and work with industry model suites (such as Diffusers) to improve the efficiency of multimodal inference on Ascend. It focuses on providing key operators and fused operators for multimodal generation. By combining strategies such as Ascend-affinity quantization/sparsity algorithms, memory-centric computing, and multi-card parallelism, it enables rapid migration and Ascend-based acceleration for Diffusers models. Future capabilities will expand further to accelerate scenarios such as multimodal understanding and omni-modality.

The modules of MindIE SD feature an independent and decoupled design, allowing them to be used individually or combined. Similar acceleration methods, such as Cache-dit and xDiT, already exist within the industry. Because their effects are similar to those of the cache and parallelism modules, solution selection trade-offs are involved. However, other components in MindIE SD can still be used individually or combined with these external methods, though all components rely on monkey patching.

Based on the PyTorch framework, MindIE SD provides Ascend acceleration capabilities to external users. These acceleration capabilities can be used independently and comprise modules such as cache, parallelism, quantization, layer, and kernel. The relevant APIs comply with the Diffusers API definitions. For details about Diffusers models that achieve Ascend acceleration based on MindIE SD, visit Modelers and ModelZoo. In addition, simple plug-in reconstruction based on Diffusers is supported. Figure 1 shows the architecture, and Table 1 describes the modules.

Figure 1 Architecture for vision generation and inference
Table 1 Main functional modules

Functional Module

Description

Basic modules

Layer module

Provides basic external acceleration APIs (including layers of features such as attn, moe, and quant). It is the basis of advanced features and can be used independently.

Kernel module

It provides high-performance Ascend kernels for multimodal generation and supports operator integration via programming models such as AscendC and Triton.

Compilation module

Based on the FX graph capabilities, fusion passes are applied once compilation is enabled, achieving automatic Ascend-affinity acceleration.

Advanced modules

Quantization module

Supports automatic enabling of the quantization capability.

Cache module

Implements memory-centric computing capabilities to deliver hardware acceleration.

Parallelism module

It provides multi-device parallel distributed acceleration capabilities, which are implemented in coordination with the layer module and PyTorch.