Class Introduction
Function
Inherits langchain_core.document_loaders.base.BaseLoader and mx_rag.document.loader.BaseLoader to parse PDF documents and enable layout parsing. To do so, you need to download an OCR model from the Internet and ensure that the network connection is normal. The third-party PaddleOCR is used here, which guarantees its own recognition accuracy. To support image recognition, pass the object of a large vision model.
Prototype
from mx_rag.document.loader import PdfLoader PdfLoader(file_path, vlm, lang, enable_ocr)
Parameters
Parameter |
Data Type |
Required/Optional |
Description |
|---|---|---|---|
file_path |
String |
Required |
Path of a PDF file. The path length range is [1,1024]. The path cannot be a soft link and cannot contain two consecutive dots (..). The number of pages in a document is less than or equal to 1,000, and the document size is less than or equal to 100 MB. |
vlm |
Img2TextLLM |
Optional |
Large vision model's object, which can parse image information in documents. For details, see Img2TextLLM. |
lang |
Lang |
Optional |
Language type of a PDF document (see Lang). The default value is Lang.CH (Chinese). |
enable_ocr |
Bool |
Optional |
Whether to enable OCR. If the value is True, OCR is used to parse images and table information. The default value is False, indicating that images are not parsed. The length and width of the image in a PDF file cannot exceed 2048 pixels. NOTE:
When enable_ocr is set to True, PaddleOCR downloads files from the Internet. This API uses the pickle module to load models, which may bring security risks during deserialization of maliciously constructed files. Ensure that the loaded model files are from trusted sources. |
Example
from mx_rag.document.loader import PdfLoader
loader = PdfLoader("test.pdf")
docs = loader.lazy_load()
print(list(docs))