Class Introduction

Function

Inherits langchain_core.document_loaders.base.BaseLoader and mx_rag.document.loader.BaseLoader to parse PDF documents and enable layout parsing. To do so, you need to download an OCR model from the Internet and ensure that the network connection is normal. The third-party PaddleOCR is used here, which guarantees its own recognition accuracy. To support image recognition, pass the object of a large vision model.

Prototype

from mx_rag.document.loader import PdfLoader
PdfLoader(file_path, vlm, lang, enable_ocr)

Parameters

Parameter

Data Type

Required/Optional

Description

file_path

String

Required

Path of a PDF file. The path length range is [1,1024]. The path cannot be a soft link and cannot contain two consecutive dots (..). The number of pages in a document is less than or equal to 1,000, and the document size is less than or equal to 100 MB.

vlm

Img2TextLLM

Optional

Large vision model's object, which can parse image information in documents. For details, see Img2TextLLM.

lang

Lang

Optional

Language type of a PDF document (see Lang). The default value is Lang.CH (Chinese).

enable_ocr

Bool

Optional

Whether to enable OCR. If the value is True, OCR is used to parse images and table information. The default value is False, indicating that images are not parsed. The length and width of the image in a PDF file cannot exceed 2048 pixels.

NOTE:

When enable_ocr is set to True, PaddleOCR downloads files from the Internet. This API uses the pickle module to load models, which may bring security risks during deserialization of maliciously constructed files. Ensure that the loaded model files are from trusted sources.

Example

from mx_rag.document.loader import PdfLoader
loader = PdfLoader("test.pdf")
docs = loader.lazy_load()
print(list(docs))