register_splitter

Function

Registers a document splitting function. A maximum of 1,000 splitters can be loaded.

Prototype

def register_splitter(splitter_class, file_types, splitter_params)

Parameters

Parameter

Data Type

Required/Optional

Description

splitter_class

TextSplitter

Required

Document splitting function, which must be a subclass of TextSplitter inherited from LangChain.

file_types

List[str]

Required

File name extension list. The value ranges for both the document type and name extension length are [1, 32]. Files in .jpg and .png formats are not supported. For example, the value can be [".txt", ".docx"].

splitter_params

Dict[str, Any]

Optional

Parameters to be passed to the document splitting function. The default value is None. The length of the parameter string cannot exceed 1024 characters. The dictionary length cannot exceed 1024 characters. The number of nested dictionary layers cannot exceed 2.

Take LangChain as an example. When splitter_class is RecursiveCharacterTextSplitter, the input parameters of splitter_params include {"chunk_size": 4000, "chunk_overlap": 20, "keep_separator": False}. chunk_size defines the size of a split block, chunk_overlap defines the size of the overlapping part between split blocks, and keep_separator indicates whether to retain separators (defaults to ["\n\n", "\n", "", ""]).