GUI Description

Function

During serving tuning, MindStudio Insight displays the end-to-end request execution in the timeline view, showing the duration of the request in each key phase and the status of the request. By analyzing the timeline, you can quickly identify service performance bottlenecks and adjust the tuning policy based on the symptom.

GUI Display

The Timeline tab page consists of the toolbar (area 1), graphical display (area 2), and data pane (area 3), as shown in Figure 1.
Figure 1 Timeline tab page
  • Area 1: toolbar, which contains common shortcut keys. From left to right, the shortcut keys are Marker List, Filter (card or unit), Search, Flow Events, Reset (page restoration), Timeline Zoom Out, and Timeline Zoom In.
  • Area 2: graphical display. The profile data collected by service is displayed on the left. The first level is the process, and the second level is the key phase information of the request. Table 1 shows the unit information. The timeline view is displayed on the right line by line, including the execution sequence and duration of each key phase.
    Table 1 Unit information

    Unit

    Description

    CPU Usage

    Average CPU usage. This unit is displayed only when the host_system_usage_freq data collection function is enabled.

    Memory Usage

    System memory usage on the host. This unit is displayed only when the host_system_usage_freq data collection function is enabled.

    NPU Usage

    NPU memory usage. This unit is displayed only when the npu_memory_usage_freq data collection function is enabled.

    KVCache

    Usage of remaining KV cache over time.

    BatchSchedule

    Group batch time, in nanoseconds.

    WAITING

    Time when a request is in the WAITING state.

    PENDING

    Time when a request is in the PENDING state.

    RUNNING

    Time when a request is in the RUNNING state.

    RUNNING2

    Time when a request is in the RUNNING2 state.

    SWAPPED

    Time when a batch is in the SWAPPED state.

    RECOMPUTE

    Time when a request is in the RECOMPUTE state.

    SUSPENDED

    Time when a batch is in the SUSPENDED state.

    END

    Time when a request is in the END state.

    END_PRE

    Time when a request is in the END_PRE state.

    STOP

    Time when a batch is in the STOP state.

    PREFILL_HOLD

    Time when a batch is in the PREFILL_HOLD state.

    http

    HTTP request lifetime data, covering the receipt, encoding, and decoding of the request.

    batchFrameworkProcessing

    Batch data, including the batch creation time, current batch type (prefill or decode), request RID, and steps.

    preprocessBatch

    Time consumed for parameter injection to batches during IBIS data distribution, in nanoseconds.

    SerializeExecuteMessage

    Time consumed for serialization during IBIS data distribution, in nanoseconds.

    setInferBuffer

    Time consumed for buffer setting during IBIS data distribution, in nanoseconds.

    grpcWriteToSlave

    Time consumed for gRPC write during IBIS data distribution, in nanoseconds.

    deserializeExecuteRequestsForInfer

    Time consumed for deserialization during IBIS data distribution, in nanoseconds.

    convertTensorBatchToBackend

    Time consumed for request conversion during IBIS data distribution, in nanoseconds.

    getInputMetadata

    Time consumed for metadata obtaining during IBIS data distribution, in nanoseconds.

    beforemodelExec

    Processing time before model execution, in nanoseconds.

    modelExec

    Model execution data, in nanoseconds, including the execution time, current batch type (prefill or decode), request RID, and steps.

    instanceExecute

    Model instance execution time, in nanoseconds.

    Queue

    Time when the request is enqueued.

    PDcommunication

    PD disaggregation communication time, in nanoseconds. This unit exists only in the PD disaggregation scenario.

    forward

    Forward propagation time of model inference, in nanoseconds.

    operatorExecute

    Python-side model API execution time, in nanoseconds.

    processPythonExecResult

    Time consumed for response conversion, serialization, and writing to the shared memory during data receiving, in nanoseconds.

    deserializeExecuteResponse

    Time consumed for deserialization during data receiving, in nanoseconds.

    saveoutAndContinueBatching

    Time consumed for parsing responses as outputs during data receiving, in nanoseconds.

    continueBatching

    Time consumed for enqueuing requests during data receiving, in nanoseconds.

    sendExecuteMessage

    Time consumed for sending execution information, in nanoseconds.

    postprocess

    Postprocessing time of model inference, in nanoseconds.

    preprocess

    Preprocessing time of model inference, in nanoseconds.

    processBroadcastMessage

    Time consumed for broadcasting communication information, in nanoseconds.

    sample

    Sampling time, in nanoseconds.

    PullKVCache

    KV cache transfer time between PD nodes, in nanoseconds. This unit exists only in the PD disaggregation scenario.

    CANN

    Operator execution time, in nanoseconds. This unit is displayed only when the acl_task_time data collection function is enabled.

    dpBatch

    DP domain information corresponding to each request during model inference.

    RequestState

    Request status changes during model inference.

  • Area 3: data pane, which displays statistics or instruction details. If you select Slice Detail, the details of a single key phase are displayed. If you select Slice List, the key phase list information of the selected area in the unit is displayed.

You can check the duration and interval at each level in the timeline view to determine whether performance problems exist in the corresponding key phase.