Class Introduction
Function
Splits the user-provided document and automatically synthesizes and filters fine-tuning data based on the split document.
Prototype
- Configuration class of fine-tuning synthesis data:
from mx_rag.tools.finetune.generator import DataProcessConfig @dataclass class DataProcessConfig(): generate_qd_prompt: str = GENERATE_QD_PROMPT llm_preferred_prompt: str = SCORING_QD_PROMPT question_number: int = 3 featured: bool = True featured_percentage: float = 0.8 preferred: bool = True llm_threshold_score: float = 0.8 rewrite: bool = True query_rewrite_number: int = 2
- Method class of fine-tuning synthesis data:
from mx_rag.tools.finetune.generator import TrainDataGenerator TrainDataGenerator(llm: Text2TextLLM, dataset_path: str, reranker: Reranker, encrypt_fn, decrypt_fn)
Parameters
The table below describes parameters of the fine-tuning synthesis data configuration class DataProcessConfig.
Parameter |
Data Type |
Required/Optional |
Description |
|---|---|---|---|
generate_qd_prompt |
String |
Optional |
Prompt used for automatically synthesizing fine-tuning data. You can modify the prompt based on the target domain to improve the fine-tuning effect. The default value is GENERATE_QD_PROMPT within the length range of (0, 1 × 1024 × 1024]. |
llm_preferred_prompt |
String |
Optional |
Prompt used for automatically filtering fine-tuning synthesis data. You can modify the prompt based on the target domain to improve the fine-tuning effect. The default value is SCORING_QD_PROMPT within the length range of (0, 1 × 1024 × 1024]. |
question_number |
Integer |
Optional |
Number of questions generated for each original document chunk. A higher value ensures more comprehensive coverage. Increasing this value improves fine-tuning effect at the expense of longer time. The default value is 3. The value range is (0, 20]. |
featured |
Bool |
Optional |
Filtering based on data relevance scores from a combined BM25 + Reranker method. The default value is True. |
featured_percentage |
Float |
Optional |
Proportion after BM25 + Reranker-based filtering. The value range is (0.0, 1.0). The default value is 0.8. |
preferred |
Bool |
Optional |
Filtering based on data relevance scores from an LLM. The default value is True. |
llm_threshold_score |
Float |
Optional |
Proportion after filtering based on data relevance scores from an LLM. The value range is (0.0, 1.0). The default value is 0.8. |
rewrite |
Bool |
Optional |
Rewriting and expanding generated data from multiple semantic perspectives of an LLM. The default value is True. |
query_rewrite_number |
Integer |
Optional |
Number of rewritten and expanded QA pairs. The value range is (0, 20]. The default value is 2. |
The GENERATE_QD_PROMPT and SCORING_QD_PROMPT are defined as follows:
GENERATE_QD_PROMPT = """Read an article and generate a related question.
Article: Climate change has severely altered marine ecosystems through escalating sea temperatures, rising sea levels, and ocean acidification. These shifts have fundamentally disrupted species distribution, ecosystem stability, and global fisheries. Consequently, in an era of accelerating global warming, the conservation of marine environments has become an urgent international priority.
Question: What are the main impacts of climate change on marine ecosystems?
Article: The retail sector represents another critical frontier for AI-driven transformation. By leveraging data analytics and machine learning algorithms, retailers can gain deeper insights into consumer behavior, emerging trends, and personal preferences. These technologies enable brands to optimize inventory management, refine recommendation engines, and sharpen marketing strategies—ultimately driving both sales growth and customer loyalty.
Question: How does AI help retailers improve customer experience and sales performance?
Ask {question_number} questions about the following article according to the preceding examples:
Article: {doc}
Output format: Number questions starting from 1 and do not include numbers after the colon in each entry.
Question 1:
...
"""
SCORING_QD_PROMPT = """Your task is to evaluate the relevance of the provided document to the given question. Assign a relevance score on a scale from 0 to 1, where 1 signifies high relevance, and 0 indicates no relevance. Your scoring must be strictly based on the directness and sufficiency of the document content in addressing the question.
Read the question and document carefully, and then provide a relevance score based on the following criteria:
- If the document directly answers the question, give a score close to 1.
- If the document is relevant to the question but does not directly answer it, give a score between 0 and 1 and decrease the score according to the relevance.
- If the document is irrelevant to the question, give 0.
Example:
Question: What did Xiao Ming eat yesterday?
Document: Xiao Ming went out with his friends yesterday and had hot pot. The day was filled with laughter.
Because the document directly answers the question, a score of 0.99 is given.
Question: How is Xiao Hong's academic performance?
Document: Xiao Hong participates actively in class, consistently submits her assignments on time, and is always willing to assist her peers. Consequently, her teacher has officially recognized her as an engaged member of the class.
While the document highlights Xiao Hong's classroom engagement and punctuality with assignments, it does not mention her academic performance. Therefore, a score of 0.10 is given.
Based on the preceding criteria, give a relevance score to the question-document pair, and round the score to two decimal places.
Question: {query}
Document: {doc}
"""
The table below describes the parameters of the fine-tuning synthesis data method class TrainDataGenerator.
Parameter |
Data Type |
Required/Optional |
Description |
|---|---|---|---|
llm |
Text2TextLLM |
Required |
LLM used for fine-tuning data synthesis and filtering. For details, see Text2TextLLM. |
dataset_path |
String |
Required |
Storage directory of the automatically synthesized and filtered fine-tuning dataset. The path length range is [1, 1024]. The path cannot contain soft links and cannot contain two consecutive dots (..). The storage path cannot be in the path list: ["/etc", "/usr/bin", "/usr/lib", "/usr/lib64", "/sys/", "/dev/", "/sbin", "/tmp"]. |
reranker |
Reranker |
Required |
Reranker used in fine-tuning synthesis data filtering process. For details, see Reranker. |
encrypt_fn |
Callable[[str], str] |
Optional |
Q-D pair encryption. The default value is None, indicating that Q-D pairs are not encrypted. NOTICE:
If the file to be uploaded contains personal data such as bank account numbers, ID card numbers, passport numbers, and passwords, set this parameter to ensure personal data security. |
decrypt_fn |
Callable[[str], str] |
Optional |
Q-D pair decryption. The default value is None. |
Example
from paddle.base import libpaddle
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from mx_rag.document import LoaderMng
from mx_rag.document.loader import DocxLoader
from mx_rag.llm import Text2TextLLM
from mx_rag.reranker.local import LocalReranker
from mx_rag.tools.finetune.generator import TrainDataGenerator, DataProcessConfig
from mx_rag.utils import ClientParam
llm = Text2TextLLM(model_name="Llama3-8B-Chinese-Chat", base_url="https://{ip}:{port}/v1/chat/completions",
client_param=ClientParam(ca_file="/path/to/ca.crt")
)
reranker = LocalReranker("/home/data/bge-reranker-large", dev_id=0)
dataset_path = "path to data_output" # Output path of the fine-tuning synthesis dataset
document_path = "path to document dir" # Path of the user-provided original document
loader_mng = LoaderMng()
loader_mng.register_loader(loader_class=TextLoader, file_types=[".txt", ".md"])
loader_mng.register_loader(loader_class=DocxLoader, file_types=[".docx"])
# Load the document splitter provided by LangChain.
loader_mng.register_splitter(splitter_class=RecursiveCharacterTextSplitter,
file_types=[".docx", ".txt", ".md"],
splitter_params={"chunk_size": 750,
"chunk_overlap": 150,
"keep_separator": False
}
)
train_data_generator = TrainDataGenerator(llm, dataset_path, reranker)
split_doc_list = train_data_generator.generate_origin_document(document_path=document_path, loader_mng=loader_mng)
config = DataProcessConfig()
train_data_generator.generate_train_data(split_doc_list, config)