Asynchronous Scheduling

The MindIE inference process is executed synchronously. An inference process can be divided into the following three phases:

  • Data preparation phase (executed on the CPU)
  • Model inference phase (executed on the NPU)
  • Data return phase (executed on the CPU)

Asynchronous scheduling leverages the time consumed during the model inference phase to mask the time taken in the data preparation and return phases. Specifically, it utilizes NPU computation time to mask CPU-side operations, excluding sampling-related overhead. However, requests carrying the EOS flag (inference termination) are repeatedly processed, resulting in unnecessary consumption of NPU computing and graphics memory resources. This feature is suitable for scenarios involving a large maxBatchSize and long input/output sequences.

Constraints

  • This feature is supported in the prefill-decode hybrid deployment and prefill-decode disaggregation scenarios.
  • This feature cannot be used with Look Ahead or Memory Decoding.
  • This feature does not support postprocessing parameters related to multi-sequence inference, such as n, best_of, and use_beam_search.

Running Inference

  1. Set the following environment variable to enable asynchronous scheduling.
    export MINDIE_ASYNC_SCHEDULING_ENABLE=1

    In the prefill-decode disaggregation scenario, perform this operation only on the decode node.

  2. Open the config.json file of the Server.
    cd {MindIE installation directory}/latest/mindie-service/
    vi conf/config.json
  3. Set serving parameters. For details about the parameters, see Configuration Parameters (Service-Specific).
  4. Start the service.
    ./bin/mindieservice_daemon
  5. Use the AISBench tool to start tuning. For details about the AISBench tool, see "Auxiliary Tools" > "Performance/Accuracy Test Tool" in MindIE Motor Development Guide.