generate_train_data
Function
Generates a certain number of questions for each document in the list, improves the fine-tuning data quality through multiple rounds of filtering and data rewriting and expansion, and outputs the dataset used for fine-tuning an embedding model.
Prototype
def generate_train_data(split_doc_list: list[str], data_process_config: DataProcessConfig, batch_size: int)
Parameters
Parameter |
Data Type |
Required/Optional |
Description |
|---|---|---|---|
split_doc_list |
list[str] |
Required |
Original document list. The list length range is [1, 1000 × 1000], and the string length range is [1, 128 × 1024 × 1024]. |
data_process_config |
DataProcessConfig |
Required |
Fine-tuning data synthesis configuration. For details, see the description of the DataProcessConfig class in Class Introduction. |
batch_size |
Integer |
Optional |
Number of concurrent records when fine-tuning data is synthesized. The default value is 8. The value range is (0, 1024]. |