Class Introduction

Description

Inherits langchain_core.document_loaders.base.BaseLoader and mx_rag.document.loader.BaseLoader class to parse .pptx documents. The text information in images and tables (a maximum of 100 × 50 tables) in the documents can be parsed. Image parsing is performed by a large vision model. During PowerPointLoader initialization, an OCR model needs to be downloaded from the Internet. Ensure that the network connection is normal. The third-party PaddleOCR is used here, which guarantees its own recognition accuracy.

Prototype

from mx_rag.document.loader import PowerPointLoader
PowerPointLoader(file_path, vlm, lang, enable_ocr)
# Enumerated value
from mx_rag.utils.common import Lang
class Lang(Enum):
    EN: str = 'en'
    CH: str = 'ch'

Parameters

Parameter

Data Type

Required/Optional

Description

file_path

String

Required

Path of a .pptx file. The path length range is [1,1024]. The path cannot be a soft link and cannot contain two consecutive dots (..). The document size cannot exceed 100 MB.

vlm

Img2TextLLM

Optional

Large vision model's object, which can parse image information in documents. For details, see Img2TextLLM.

lang

Lang

Optional

Language type of the text in the image to be recognized during image OCR. Currently, Chinese and English (Lang.CH/Lang.EN) are supported. The default value is Lang.CH.

enable_ocr

Bool

Optional

Whether to enable OCR to parse images. The default value is False, indicating that images are not parsed. Images whose resolution exceeds 4096 × 4096 pixels or whose height and width are fewer than 256 pixels cannot be parsed.

NOTE:

When enable_ocr is set to True, PaddleOCR downloads files from the Internet. This API uses the pickle module to load models, which may bring security risks during deserialization of maliciously constructed files. Ensure that the loaded model files are from trusted sources.

Example

from mx_rag.document.loader import PowerPointLoader
loader = PowerPointLoader("./test.pptx")
docs = loader.lazy_load()
print(list(docs))