Class Introduction

Function

Splits the user-provided document and automatically synthesizes and filters fine-tuning data based on the split document.

Prototype

  • Configuration class of fine-tuning synthesis data:
    from mx_rag.tools.finetune.generator import DataProcessConfig
    @dataclass
    class DataProcessConfig():
        generate_qd_prompt: str = GENERATE_QD_PROMPT
        llm_preferred_prompt: str = SCORING_QD_PROMPT
        question_number: int = 3
        featured: bool = True
        featured_percentage: float = 0.8
        preferred: bool = True
        llm_threshold_score: float = 0.8
        rewrite: bool = True
        query_rewrite_number: int = 2
  • Method class of fine-tuning synthesis data:
    from mx_rag.tools.finetune.generator import TrainDataGenerator
    TrainDataGenerator(llm: Text2TextLLM, dataset_path: str, reranker: Reranker, encrypt_fn, decrypt_fn)

Parameters

The table below describes parameters of the fine-tuning synthesis data configuration class DataProcessConfig.

Parameter

Data Type

Required/Optional

Description

generate_qd_prompt

String

Optional

Prompt used for automatically synthesizing fine-tuning data. You can modify the prompt based on the target domain to improve the fine-tuning effect. The default value is GENERATE_QD_PROMPT within the length range of (0, 1 × 1024 × 1024].

llm_preferred_prompt

String

Optional

Prompt used for automatically filtering fine-tuning synthesis data. You can modify the prompt based on the target domain to improve the fine-tuning effect. The default value is SCORING_QD_PROMPT within the length range of (0, 1 × 1024 × 1024].

question_number

Integer

Optional

Number of questions generated for each original document chunk. A higher value ensures more comprehensive coverage. Increasing this value improves fine-tuning effect at the expense of longer time. The default value is 3. The value range is (0, 20].

featured

Bool

Optional

Filtering based on data relevance scores from a combined BM25 + Reranker method. The default value is True.

featured_percentage

Float

Optional

Proportion after BM25 + Reranker-based filtering. The value range is (0.0, 1.0). The default value is 0.8.

preferred

Bool

Optional

Filtering based on data relevance scores from an LLM. The default value is True.

llm_threshold_score

Float

Optional

Proportion after filtering based on data relevance scores from an LLM. The value range is (0.0, 1.0). The default value is 0.8.

rewrite

Bool

Optional

Rewriting and expanding generated data from multiple semantic perspectives of an LLM. The default value is True.

query_rewrite_number

Integer

Optional

Number of rewritten and expanded QA pairs. The value range is (0, 20]. The default value is 2.

The GENERATE_QD_PROMPT and SCORING_QD_PROMPT are defined as follows:

GENERATE_QD_PROMPT = """Read an article and generate a related question.
Article: Climate change has severely altered marine ecosystems through escalating sea temperatures, rising sea levels, and ocean acidification. These shifts have fundamentally disrupted species distribution, ecosystem stability, and global fisheries. Consequently, in an era of accelerating global warming, the conservation of marine environments has become an urgent international priority.
Question: What are the main impacts of climate change on marine ecosystems?
Article: The retail sector represents another critical frontier for AI-driven transformation. By leveraging data analytics and machine learning algorithms, retailers can gain deeper insights into consumer behavior, emerging trends, and personal preferences. These technologies enable brands to optimize inventory management, refine recommendation engines, and sharpen marketing strategies—ultimately driving both sales growth and customer loyalty.
Question: How does AI help retailers improve customer experience and sales performance?
Ask {question_number} questions about the following article according to the preceding examples:
Article: {doc}
Output format: Number questions starting from 1 and do not include numbers after the colon in each entry.
Question 1:
...
"""
SCORING_QD_PROMPT = """Your task is to evaluate the relevance of the provided document to the given question. Assign a relevance score on a scale from 0 to 1, where 1 signifies high relevance, and 0 indicates no relevance. Your scoring must be strictly based on the directness and sufficiency of the document content in addressing the question.
Read the question and document carefully, and then provide a relevance score based on the following criteria:
- If the document directly answers the question, give a score close to 1.
- If the document is relevant to the question but does not directly answer it, give a score between 0 and 1 and decrease the score according to the relevance.
- If the document is irrelevant to the question, give 0.
Example:
Question: What did Xiao Ming eat yesterday?
Document: Xiao Ming went out with his friends yesterday and had hot pot. The day was filled with laughter.
Because the document directly answers the question, a score of 0.99 is given.
Question: How is Xiao Hong's academic performance?
Document: Xiao Hong participates actively in class, consistently submits her assignments on time, and is always willing to assist her peers. Consequently, her teacher has officially recognized her as an engaged member of the class.
While the document highlights Xiao Hong's classroom engagement and punctuality with assignments, it does not mention her academic performance. Therefore, a score of 0.10 is given.
Based on the preceding criteria, give a relevance score to the question-document pair, and round the score to two decimal places.
Question: {query}
Document: {doc}
"""

The table below describes the parameters of the fine-tuning synthesis data method class TrainDataGenerator.

Parameter

Data Type

Required/Optional

Description

llm

Text2TextLLM

Required

LLM used for fine-tuning data synthesis and filtering. For details, see Text2TextLLM.

dataset_path

String

Required

Storage directory of the automatically synthesized and filtered fine-tuning dataset. The path length range is [1, 1024]. The path cannot contain soft links and cannot contain two consecutive dots (..).

The storage path cannot be in the path list: ["/etc", "/usr/bin", "/usr/lib", "/usr/lib64", "/sys/", "/dev/", "/sbin", "/tmp"].

reranker

Reranker

Required

Reranker used in fine-tuning synthesis data filtering process. For details, see Reranker.

encrypt_fn

Callable[[str], str]

Optional

Q-D pair encryption. The default value is None, indicating that Q-D pairs are not encrypted.

NOTICE:

If the file to be uploaded contains personal data such as bank account numbers, ID card numbers, passport numbers, and passwords, set this parameter to ensure personal data security.

decrypt_fn

Callable[[str], str]

Optional

Q-D pair decryption. The default value is None.

Example

from paddle.base import libpaddle
from langchain_community.document_loaders import TextLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from mx_rag.document import LoaderMng
from mx_rag.document.loader import DocxLoader
from mx_rag.llm import Text2TextLLM
from mx_rag.reranker.local import LocalReranker
from mx_rag.tools.finetune.generator import TrainDataGenerator, DataProcessConfig

from mx_rag.utils import ClientParam


llm = Text2TextLLM(model_name="Llama3-8B-Chinese-Chat", base_url="https://{ip}:{port}/v1/chat/completions", 
client_param=ClientParam(ca_file="/path/to/ca.crt")
)
reranker = LocalReranker("/home/data/bge-reranker-large", dev_id=0)
dataset_path = "path to data_output" # Output path of the fine-tuning synthesis dataset

document_path = "path to document dir" # Path of the user-provided original document

loader_mng = LoaderMng()

loader_mng.register_loader(loader_class=TextLoader, file_types=[".txt", ".md"])
loader_mng.register_loader(loader_class=DocxLoader, file_types=[".docx"])

# Load the document splitter provided by LangChain.
loader_mng.register_splitter(splitter_class=RecursiveCharacterTextSplitter,
                             file_types=[".docx", ".txt", ".md"],
                             splitter_params={"chunk_size": 750,
                                              "chunk_overlap": 150,
                                              "keep_separator": False
                                              }
                             )

train_data_generator = TrainDataGenerator(llm, dataset_path, reranker)

split_doc_list = train_data_generator.generate_origin_document(document_path=document_path, loader_mng=loader_mng)
config = DataProcessConfig()
train_data_generator.generate_train_data(split_doc_list, config)