Class Introduction
Description
Inherits langchain_core.document_loaders.base.BaseLoader and mx_rag.document.loader.BaseLoader class to parse .pptx documents. The text information in images and tables (a maximum of 100 × 50 tables) in the documents can be parsed. Image parsing is performed by a large vision model. During PowerPointLoader initialization, an OCR model needs to be downloaded from the Internet. Ensure that the network connection is normal. The third-party PaddleOCR is used here, which guarantees its own recognition accuracy.
Prototype
from mx_rag.document.loader import PowerPointLoader
PowerPointLoader(file_path, vlm, lang, enable_ocr)
# Enumerated value
from mx_rag.utils.common import Lang
class Lang(Enum):
EN: str = 'en'
CH: str = 'ch'
Parameters
Parameter |
Data Type |
Required/Optional |
Description |
|---|---|---|---|
file_path |
String |
Required |
Path of a .pptx file. The path length range is [1,1024]. The path cannot be a soft link and cannot contain two consecutive dots (..). The document size cannot exceed 100 MB. |
vlm |
Img2TextLLM |
Optional |
Large vision model's object, which can parse image information in documents. For details, see Img2TextLLM. |
lang |
Lang |
Optional |
Language type of the text in the image to be recognized during image OCR. Currently, Chinese and English (Lang.CH/Lang.EN) are supported. The default value is Lang.CH. |
enable_ocr |
Bool |
Optional |
Whether to enable OCR to parse images. The default value is False, indicating that images are not parsed. Images whose resolution exceeds 4096 × 4096 pixels or whose height and width are fewer than 256 pixels cannot be parsed. NOTE:
When enable_ocr is set to True, PaddleOCR downloads files from the Internet. This API uses the pickle module to load models, which may bring security risks during deserialization of maliciously constructed files. Ensure that the loaded model files are from trusted sources. |
Example
from mx_rag.document.loader import PowerPointLoader
loader = PowerPointLoader("./test.pptx")
docs = loader.lazy_load()
print(list(docs))