Class Introduction
Function
Assists users in synthesizing an evaluation dataset based on a given document. Users need to manually filter the synthesized evaluation dataset and select QA pairs that align with the target domain's characteristics to accurately assess model precision.
Prototype
from mx_rag.tools.finetune.generator.eval_data_generator import EvalDataGenerator EvalDataGenerator(llm: Text2TextLLM, dataset_path: str, encrypt_fn, decrypt_fn)
Parameters
Parameter |
Data Type |
Required/Optional |
Description |
|---|---|---|---|
llm |
Text2TextLLM |
Required |
LLM used to synthesize an evaluation dataset. For details, see Text2TextLLM. |
dataset_path |
String |
Required |
Path that stores the evaluation dataset file. The path length range is [1, 1024]. The path cannot contain soft links and cannot contain two consecutive dots (..). The storage path cannot be in the path list: ["/etc", "/usr/bin", "/usr/lib", "/usr/lib64", "/sys/", "/dev/", "/sbin", "/tmp"]. |
encrypt_fn |
Callable[[str], str] |
Optional |
Callback function to encrypt the generated Q-D pairs. The return value is a string with a maximum length of 128 × 1024 × 1024. The default value is None, indicating that the Q-D pairs are not encrypted. NOTICE:
If the file to be uploaded contains personal data such as bank account numbers, ID card numbers, passport numbers, and passwords, set this parameter to ensure personal data security. |
decrypt_fn |
Callable[[str], str] |
Optional |
Callback function to decrypt the stored Q-D pairs. The return value is a string with a maximum length of 128 × 1024 × 1024. The default value is None. |
Example
from paddle.base import libpaddle
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from mx_rag.document import LoaderMng
from mx_rag.document.loader import DocxLoader
from mx_rag.llm import Text2TextLLM
from mx_rag.tools.finetune.generator.eval_data_generator import EvalDataGenerator
from mx_rag.utils import ClientParam
llm = Text2TextLLM(model_name="Llama3-8B-Chinese-Chat", base_url="https://{ip}:{port}/v1/chat/completions",
client_param=ClientParam(ca_file="/path/to/ca.crt")
)
dataset_path = "path to data_output" # Output path of the fine-tuning synthesis dataset
document_path = "path to document dir" # Path of the user-provided original document
eval_data_generator = EvalDataGenerator(llm, dataset_path)
loader_mng = LoaderMng()
loader_mng.register_loader(loader_class=TextLoader, file_types=[".txt", ".md"])
loader_mng.register_loader(loader_class=DocxLoader, file_types=[".docx"])
# Load the document splitter provided by LangChain.
loader_mng.register_splitter(splitter_class=RecursiveCharacterTextSplitter,
file_types=[".docx", ".txt", ".md"],
splitter_params={"chunk_size": 750,
"chunk_overlap": 150,
"keep_separator": False
}
)
split_doc_list = eval_data_generator.generate_origin_document(document_path=document_path, loader_mng=loader_mng)
eval_data_generator.generate_evaluate_data(split_doc_list)