generate_origin_document

Function

Parses and splits the original document provided by users for synthesizing fine-tuning data.

Prototype

def generate_origin_document(document_path: str, loader_mng: LoaderMng, filter_func: Callable[[List[str]], List[str]])

Parameters

Parameter

Data Type

Required/Optional

Description

document_path

String

Required

Path where the original document is stored. The path length range is [1, 1024]. The path cannot contain soft links and cannot contain two consecutive dots (..).

loader_mng

LoaderMng

Required

Document parser and splitter. For details, see LoaderMng.

filter_func

Callable

Optional

Callback function for data cleaning on document chunks after parsing and splitting. The input and output parameters are both List[str]. The default value is None.

Return Value

Data Type

Description

list[str]

List of split document chunks