Micro Batch

Micro-batch processing is a technique where data is split into smaller batches for execution. In the current implementation, an additional data stream is created to split a batch of data into two batches, which are executed on two separate data streams. When data stream 1 performs computation, data stream 2 performs communication. The communication–computation overlap enables hardware resources to be fully utilized to improve inference throughput.

Figure 1 Micro-batch processing with dual data streams

Data streams are synchronized using the event mechanism, ensuring that computation and communication tasks do not conflict with each other and preventing hardware resource contention. This feature is typically used in the prefill phase because communication operators consume long execution time and the execution duration of communication and compute operators is balanced. In this implementation, the overlap between computation and communication exceeds 70%.

Constraints

This feature cannot be enabled together with the communication-computing fused operator feature.
This feature cannot be enabled together with the Python graph.
This feature can be enabled only together with the quantization feature.
Only Qwen2.5-14B, Qwen3-14B, Deepseek-R1, and DeepSeek-V3.1 models support this feature.
Enabling this feature will occupy extra graphics memory. In serving scenarios, if the number of KV caches decreases, scheduling will be affected and the throughput will decrease. Therefore, you are advised not to enable this feature when the graphics memory is limited.

Parameters

Table 1 describes the parameters required for enabling the micro batch feature.

**Table 1** Supplementary parameters of the micro batch feature: models in ModelConfig
Parameter	Value Type	Value Range	Description
stream_options
micro_batch	Bool	true false	Specifies whether to enable the communication-computing dual-stream overlapping feature. The default value is false (disabling the feature).

Running Inference

Open the config.json file of the Server.

cd {MindIE installation directory}/latest/mindie-service/
vi conf/config.json

Set serving parameters. Add the micro_batch field (the following content in bold) to the config.json file of the Server. For details about the fields, see Table 1. For details about the serving parameters, see Configuration Parameters (Service-Specific). The following is a parameter configuration example:

        
         
           
           
             "ModelDeployConfig" :
{
   "maxSeqLen" : 2560,
   "maxInputTokenLen" : 2048,
   "truncation" : false,
   "ModelConfig" : [
     {
         "modelInstanceType" : "Standard",
         "modelName" : "Qwen3-14B",
         "modelWeightPath" : "/data/weights/Qwen3-14B",
         "worldSize" : 8,
         "cpuMemSize" : 5,
         "npuMemSize" : -1,
         "backendType" : "atb",
         "trustRemoteCode" : false,
         "models": {
            "qwen3": {
                "ccl": {
                    "enable_mc2": false,
                },
                "stream_options": {
                    "micro_batch": true,
                }
            }
         }
      }
   ]
},

            

          

        
       

Start the service.
```
./bin/mindieservice_daemon
```

Parent topic: Acceleration Features