Overview

Typically, embedding models are trained on general-purpose datasets, leading to suboptimal retrieval performance due to insufficient precision in some special fields. To resolve this problem, a domain-specific fine-tuning method for embedding models is introduced, including synthetic evaluation data, model evaluation, automated fine-tuning data synthesis, and model fine-tuning.

Synthetic evaluation data: Utilizing user-provided reference texts, an LLM is leveraged to generate representative QA pairs. Then, QA pairs containing domain-specific terminology are manually curated. This approach helps assess the precision of embedding models within specialized fields.
Model evaluation:Adopting the Sentence Transformers (SBERT) evaluation method, the synthetic and filtered evaluation dataset is used to benchmark embedding model precision, with a specific focus on the recall rate.
Automated fine-tuning data synthesis: Utilizing user-provided text sets, an LLM is used to automatically generate synthetic fine-tuning datasets and select data most relevant to the target domain through multiple automatic filtering methods.
Model fine-tuning: Adopting the SBERT fine-tuning method, the automatically synthesized and filtered dataset is used to fine-tune embedding models and output the modified ones.

The figure below illustrates the process of embedding model fine-tuning.

Parent topic: Embedding Model Fine-Tuning