Class Introduction
Function
Inherits langchain_core.document_loaders.base.BaseLoader and mx_rag.document.loader.BaseLoader to parse .docx files. The files in .doc format are not supported. Text content can be parsed. For a vision model, images can be recognized, but layout recognition is not supported.
Prototype
from mx_rag.document.loader import DocxLoader DocxLoader(file_path, vlm)
Parameters
Parameter |
Data Type |
Required/Optional |
Description |
|---|---|---|---|
file_path |
String |
Required |
Docx file path. The path length range is [1, 1024]. The document path cannot be a soft link and cannot contain two consecutive dots (..). The number of words in a single document is less than or equal to 500000. The document size is less than or equal to 100 MB. |
vlm |
Img2TextLLM |
Optional |
Large vision model's object, which can parse image information in documents. For details, see Img2TextLLM. |
Example
from mx_rag.document.loader import DocxLoader
from mx_rag.llm import Img2TextLLM, LLMParameterConfig
from mx_rag.utils import ClientParam
vlm = Img2TextLLM(base_url="https://{ip}:{port}/openai/v1/chat/completions",
model_name="Qwen2.5-VL-7B-Instruct",
llm_config=LLMParameterConfig(max_tokens=512),
client_param=ClientParam(ca_file="/path/to/ca.crt")
)
loader = DocxLoader("/path/to/document.docx", vlm=vlm)
docs = loader.lazy_load()
print(list(docs))
Parent topic: DocxLoader