generate_train_data

Function

Generates a certain number of questions for each document in the list, improves the fine-tuning data quality through multiple rounds of filtering and data rewriting and expansion, and outputs the dataset used for fine-tuning an embedding model.

Prototype

def generate_train_data(split_doc_list: list[str], data_process_config: DataProcessConfig, batch_size: int)

Parameters

Parameter

Data Type

Required/Optional

Description

split_doc_list

list[str]

Required

Original document list. The list length range is [1, 1000 × 1000], and the string length range is [1, 128 × 1024 × 1024].

data_process_config

DataProcessConfig

Required

Fine-tuning data synthesis configuration. For details, see the description of the DataProcessConfig class in Class Introduction.

batch_size

Integer

Optional

Number of concurrent records when fine-tuning data is synthesized. The default value is 8. The value range is (0, 1024].