Introduction

In MoE architectures, the number of input tokens distributed across experts can vary significantly, resulting in imbalanced AlltoAll communication and uneven expert workload distribution. NPUs hosting hot experts suffer from insufficient compute and communication resources, while those with cold experts are prone to underutilization, leading to performance degradation. The load balancing feature is designed to reduce NPU resource imbalance and improve model inference performance.

MindIE supports two load balancing modes: static load balancing in redundancy mode and forcible load balancing.

Static load balancing in redundancy mode: Redundant experts are deployed to share the load of hot experts, thereby facilitating effective load balancing.

Forcible load balancing: Mock the outputs of the top k operator by replacing the original top k outputs with fake tensors that ensure absolute load balancing among experts. This mode only provides a theoretical upper limit for load balancing. It changes the actual routing of model experts and cannot be used in official services.

Parent topic: Load Balancing