Class Introduction
Function
Inherits langchain_text_splitters.character.RecursiveCharacterTextSplitter and langchain_text_splitters.markdown.MarkdownHeaderTextSplitter to split markdown files. The length of a string split at a time cannot exceed 100 MB.
Prototype
from mx_rag.document.splitter import MarkdownTextSplitter MarkdownTextSplitter(chunk_size, chunk_overlap, header_level, **kwargs)
Parameters
Parameter |
Data Type |
Required/Optional |
Description |
|---|---|---|---|
chunk_size |
Integer |
Optional |
Chunk size. The value is greater than 0, and the default value is 1000. |
chunk_overlap |
Integer |
Optional |
Chunk overlap size. The value is greater than or equal to 0 and must be less than chunk_size. The default value is 50. |
header_level |
Integer |
Optional |
Title parsing level. The value ranges from 0 to 6, and the default value is 3. |
**kwargs |
Dict[str, Any] |
Optional |
Additional keyword parameters for the parent class RecursiveCharacterTextSplitter of LangChain. |
Example
from mx_rag.document.loader import MarkdownLoader
from mx_rag.document.splitter import MarkdownTextSplitter
from mx_rag.llm import Img2TextLLM, LLMParameterConfig
from mx_rag.utils import ClientParam
vlm = Img2TextLLM(base_url="https://{ip}:{port}/openai/v1/chat/completions",
model_name="Qwen2.5-VL-7B-Instruct",
llm_config=LLMParameterConfig(max_tokens=512),
client_param=ClientParam(ca_file="/path/to/ca.crt")
)
loader = MarkdownLoader("/path/to/document.md", vlm=vlm, process_images_separately=False)
docs = loader.lazy_load()
splitter = MarkdownTextSplitter(chunk_size=1000, chunk_overlap=50, header_level=3)
for doc in docs:
chunks = splitter.split_text(doc.page_content)
print(chunks)
Parent topic: MarkdownTextSplitter