Class Introduction

Function

Inherits langchain_core.document_loaders.base.BaseLoader and mx_rag.document.loader.BaseLoader to parse PDF documents and enable layout parsing. To do so, you need to download an OCR model from the Internet and ensure that the network connection is normal. The third-party PaddleOCR is used here, which guarantees its own recognition accuracy. To support image recognition, pass the object of a large vision model.

Prototype

from mx_rag.document.loader import PdfLoader
PdfLoader(file_path, vlm, lang, enable_ocr)

Parameters

Parameter	Data Type	Required/Optional	Description
file_path	String	Required	Path of a PDF file. The path length range is [1,1024]. The path cannot be a soft link and cannot contain two consecutive dots (..). The number of pages in a document is less than or equal to 1,000, and the document size is less than or equal to 100 MB.
vlm	Img2TextLLM	Optional	Large vision model's object, which can parse image information in documents. For details, see Img2TextLLM.
lang	Lang	Optional	Language type of a PDF document (see Lang). The default value is Lang.CH (Chinese).
enable_ocr	Bool	Optional	Whether to enable OCR. If the value is True, OCR is used to parse images and table information. The default value is False, indicating that images are not parsed. The length and width of the image in a PDF file cannot exceed 2048 pixels. NOTE: When enable_ocr is set to True, PaddleOCR downloads files from the Internet. This API uses the pickle module to load models, which may bring security risks during deserialization of maliciously constructed files. Ensure that the loaded model files are from trusted sources.

Example

from mx_rag.document.loader import PdfLoader
loader = PdfLoader("test.pdf")
docs = loader.lazy_load()
print(list(docs))

Parent topic: PdfLoader