Class Introduction

Function

Assists users in synthesizing an evaluation dataset based on a given document. Users need to manually filter the synthesized evaluation dataset and select QA pairs that align with the target domain's characteristics to accurately assess model precision.

Prototype

from mx_rag.tools.finetune.generator.eval_data_generator import EvalDataGenerator
EvalDataGenerator(llm: Text2TextLLM, dataset_path: str, encrypt_fn, decrypt_fn)

Parameters

Parameter

Data Type

Required/Optional

Description

llm

Text2TextLLM

Required

LLM used to synthesize an evaluation dataset. For details, see Text2TextLLM.

dataset_path

String

Required

Path that stores the evaluation dataset file. The path length range is [1, 1024]. The path cannot contain soft links and cannot contain two consecutive dots (..).

The storage path cannot be in the path list: ["/etc", "/usr/bin", "/usr/lib", "/usr/lib64", "/sys/", "/dev/", "/sbin", "/tmp"].

encrypt_fn

Callable[[str], str]

Optional

Callback function to encrypt the generated Q-D pairs. The return value is a string with a maximum length of 128 × 1024 × 1024. The default value is None, indicating that the Q-D pairs are not encrypted.

NOTICE:

If the file to be uploaded contains personal data such as bank account numbers, ID card numbers, passport numbers, and passwords, set this parameter to ensure personal data security.

decrypt_fn

Callable[[str], str]

Optional

Callback function to decrypt the stored Q-D pairs. The return value is a string with a maximum length of 128 × 1024 × 1024. The default value is None.

Example

from paddle.base import libpaddle
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from mx_rag.document import LoaderMng
from mx_rag.document.loader import DocxLoader
from mx_rag.llm import Text2TextLLM
from mx_rag.tools.finetune.generator.eval_data_generator import EvalDataGenerator
from mx_rag.utils import ClientParam

llm = Text2TextLLM(model_name="Llama3-8B-Chinese-Chat", base_url="https://{ip}:{port}/v1/chat/completions", 
client_param=ClientParam(ca_file="/path/to/ca.crt")
)

dataset_path = "path to data_output" # Output path of the fine-tuning synthesis dataset

document_path = "path to document dir" # Path of the user-provided original document

eval_data_generator = EvalDataGenerator(llm, dataset_path)

loader_mng = LoaderMng()

loader_mng.register_loader(loader_class=TextLoader, file_types=[".txt", ".md"])
loader_mng.register_loader(loader_class=DocxLoader, file_types=[".docx"])

# Load the document splitter provided by LangChain.
loader_mng.register_splitter(splitter_class=RecursiveCharacterTextSplitter,
                             file_types=[".docx", ".txt", ".md"],
                             splitter_params={"chunk_size": 750,
                                              "chunk_overlap": 150,
                                              "keep_separator": False
                                              }
                             )

split_doc_list = eval_data_generator.generate_origin_document(document_path=document_path, loader_mng=loader_mng)

eval_data_generator.generate_evaluate_data(split_doc_list)