Class Introduction

Function

Inherits langchain_text_splitters.character.RecursiveCharacterTextSplitter and langchain_text_splitters.markdown.MarkdownHeaderTextSplitter to split markdown files. The length of a string split at a time cannot exceed 100 MB.

Prototype

from mx_rag.document.splitter import MarkdownTextSplitter
MarkdownTextSplitter(chunk_size, chunk_overlap, header_level, **kwargs)

Parameters

Parameter

Data Type

Required/Optional

Description

chunk_size

Integer

Optional

Chunk size. The value is greater than 0, and the default value is 1000.

chunk_overlap

Integer

Optional

Chunk overlap size. The value is greater than or equal to 0 and must be less than chunk_size. The default value is 50.

header_level

Integer

Optional

Title parsing level. The value ranges from 0 to 6, and the default value is 3.

**kwargs

Dict[str, Any]

Optional

Additional keyword parameters for the parent class RecursiveCharacterTextSplitter of LangChain.

Example

from mx_rag.document.loader import MarkdownLoader
from mx_rag.document.splitter import MarkdownTextSplitter
from mx_rag.llm import Img2TextLLM, LLMParameterConfig
from mx_rag.utils import ClientParam

vlm = Img2TextLLM(base_url="https://{ip}:{port}/openai/v1/chat/completions",
                   model_name="Qwen2.5-VL-7B-Instruct",
                   llm_config=LLMParameterConfig(max_tokens=512),
                   client_param=ClientParam(ca_file="/path/to/ca.crt")
                   )
loader = MarkdownLoader("/path/to/document.md", vlm=vlm, process_images_separately=False)
docs = loader.lazy_load()

splitter = MarkdownTextSplitter(chunk_size=1000, chunk_overlap=50, header_level=3)
for doc in docs:
     chunks = splitter.split_text(doc.page_content)
     print(chunks)