Class Introduction

Function

Inherits langchain_core.document_loaders.base.BaseLoader and mx_rag.document.loader.BaseLoader to parse .docx files. The files in .doc format are not supported. Text content can be parsed. For a vision model, images can be recognized, but layout recognition is not supported.

Prototype

from mx_rag.document.loader import DocxLoader
DocxLoader(file_path, vlm)

Parameters

Parameter

Data Type

Required/Optional

Description

file_path

String

Required

Docx file path. The path length range is [1, 1024]. The document path cannot be a soft link and cannot contain two consecutive dots (..).

The number of words in a single document is less than or equal to 500000. The document size is less than or equal to 100 MB.

vlm

Img2TextLLM

Optional

Large vision model's object, which can parse image information in documents. For details, see Img2TextLLM.

Example

from mx_rag.document.loader import DocxLoader
from mx_rag.llm import Img2TextLLM, LLMParameterConfig
from mx_rag.utils import ClientParam

vlm = Img2TextLLM(base_url="https://{ip}:{port}/openai/v1/chat/completions",
                   model_name="Qwen2.5-VL-7B-Instruct",
                   llm_config=LLMParameterConfig(max_tokens=512),
                   client_param=ClientParam(ca_file="/path/to/ca.crt")
                   )
loader = DocxLoader("/path/to/document.docx", vlm=vlm)
docs = loader.lazy_load()
print(list(docs))