Overview
Product Introduction
MindIE Motor describes a request scheduling framework oriented to LLM prefill-decode disaggregation inference, which provides inference service capabilities through an open and scalable inference service platform architecture and connects to MindIE LLM to meet the high-performance inference requirements of LLMs. The MindIE Motor provides the following capabilities:
- Prefill-decode disaggregation request scheduling: distributes external customer requests to the prefill/decode instance with the lowest load, achieving load balancing.
- Reliability, availability, and serviceability (RAS): enhances the reliability, availability, and serviceability of the prefill-decode disaggregation service.
The following figure shows the interaction architecture of MindIE Motor and peripheral components.
Figure 1 MindIE Motor architecture
MindIE Motor provides service scheduling and RAS capabilities for prefill-decode disaggregation inference. The key components and modules are described as follows:
- Coordinator: the entry of user inference requests. It receives high-concurrency inference requests, schedules, manages, and forwards requests. It is the data request entry of the entire cluster.
- Endpoint: external RESTful APIs, such as the OpenAI API.
- Metrics: overall metrics of the prefill-decode disaggregation service, which are the summary of the metrics of the prefill/decode instances of the entire service.
- Controller Monitor: receives instance status information synchronized by the Controller, such as the health status and faulty instances.
- LoadBalancer: load balancing and scheduling.
- RequestMonitor: monitors the request status, such as the request phase and request exceptions.
- Controller: implements service status management and control of all prefill/decode instances, PD identity management and decision-making, and RAS capabilities in a cluster. It is the status manager and decision-making brain of the entire cluster.
- FaultManager: fault management module, which receives and handles reported faults, such as isolation, restart, and self-healing.
- InsManager: instance manager, which allocates and adjusts PD instance identities.
- CCAEReporter: reports O&M management information, such as PD instances and metrics.
- InsMonitor: PD instance monitoring, including heartbeat and load.
- MindIE LLM: provides servicing inference capabilities for a single model service instance (prefiller/decoder) and LLM acceleration features such as ContinuousBatching, PagedAttention, and speculative inference.
- ClusterD: high-level component of MindCluster, which is responsible for fault diagnosis and delivery of the global rank table (networking and device information required by the entire prefill-decode disaggregation service).
- CCAE: a visualized O&M platform that integrates computing, storage, and network resources.