Feature Description

A multimodal understanding model is a deep learning LLM model capable of processing and understanding multiple data types. Currently, the multimodal understanding model primarily processes data types such as text, image, video, and audio. It extracts and integrates features from these data types and utilizes a base LLM model to understand and generate corresponding content.

  • Advantages: improved cross-modal information extraction and accurate understanding of multimodal data
  • Application scenarios: image Q&A, sentiment analysis, natural language dialogue, video analysis, and autonomous driving.

Due to the large amount of data, data representation alignment and higher computing resource requirements present new challenges. However, the multimodal model can utilize data from at least two modalities, such as text, image, audio, or video. It fuses features extracted from input multimodal data, enabling more comprehensive and accurate understanding and inference capabilities.

Only multimodal understanding models capable of processing multimodal inputs and producing text outputs are supported. For details about multimodal generative models, see the MindIE SD Development Guide.

The inference method of a multimodal understanding model is slightly different from that of an LLM. You can refer to the README file of the corresponding model in the model repository for inference. The README path of the model is as follows:

${llm_path}/examples/models/{model}/README.md

Constraints

  • This feature is supported by the Atlas 800I A2 inference server and Atlas 300I Duo inference card.
  • For details about the model feature matrix and related documents, see "List of Multimodal Understanding Models" in "List of Supported Models".