Micro Batch

Micro-batch processing is a technique where data is split into smaller batches for execution. In the current implementation, an additional data stream is created to split a batch of data into two batches, which are executed on two separate data streams. When data stream 1 performs computation, data stream 2 performs communication. The communication–computation overlap enables hardware resources to be fully utilized to improve inference throughput.

Figure 1 Micro-batch processing with dual data streams

Data streams are synchronized using the event mechanism, ensuring that computation and communication tasks do not conflict with each other and preventing hardware resource contention. This feature is typically used in the prefill phase because communication operators consume long execution time and the execution duration of communication and compute operators is balanced. In this implementation, the overlap between computation and communication exceeds 70%.

Constraints

  • This feature cannot be enabled together with the communication-computing fused operator feature.
  • This feature cannot be enabled together with the Python graph.
  • This feature can be enabled only together with the quantization feature.
  • Only Qwen2.5-14B, Qwen3-14B, Deepseek-R1, and DeepSeek-V3.1 models support this feature.
  • Enabling this feature will occupy extra graphics memory. In serving scenarios, if the number of KV caches decreases, scheduling will be affected and the throughput will decrease. Therefore, you are advised not to enable this feature when the graphics memory is limited.

Parameters

Table 1 describes the parameters required for enabling the micro batch feature.

Table 1 Supplementary parameters of the micro batch feature: models in ModelConfig

Parameter

Value Type

Value Range

Description

stream_options

micro_batch

Bool

  • true
  • false

Specifies whether to enable the communication-computing dual-stream overlapping feature.

The default value is false (disabling the feature).

Running Inference

  1. Open the config.json file of the Server.
    cd {MindIE installation directory}/latest/mindie-service/
    vi conf/config.json
  2. Set serving parameters. Add the micro_batch field (the following content in bold) to the config.json file of the Server. For details about the fields, see Table 1. For details about the serving parameters, see Configuration Parameters (Service-Specific). The following is a parameter configuration example:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    "ModelDeployConfig" :
    {
       "maxSeqLen" : 2560,
       "maxInputTokenLen" : 2048,
       "truncation" : false,
       "ModelConfig" : [
         {
             "modelInstanceType" : "Standard",
             "modelName" : "Qwen3-14B",
             "modelWeightPath" : "/data/weights/Qwen3-14B",
             "worldSize" : 8,
             "cpuMemSize" : 5,
             "npuMemSize" : -1,
             "backendType" : "atb",
             "trustRemoteCode" : false,
             "models": {
                "qwen3": {
                    "ccl": {
                        "enable_mc2": false,
                    },
                    "stream_options": {
                        "micro_batch": true,
                    }
                }
             }
          }
       ]
    },
    
  3. Start the service.
    ./bin/mindieservice_daemon