Class Introduction
Function
After vectorization is performed on the question to be searched for, a proximity algorithm is used to obtain the top k similar vector IDs in a vector database, and the top k related documents in a relational database are obtained by using the vector IDs. This class inherits the langchain_core.retrievers.BaseRetriever class and calls the invoke method of the base class to enable retrieval. The type of the question to be retrieved is string, and the length cannot exceed 1 million characters.
Prototype
from mx_rag.retrievers import Retriever # All parameters must be passed through keyword parameters. Retriever(vector_store, document_store, embed_func, k, score_threshold)
Dependency

Parameters
All parameters must be passed through keyword parameters.
Parameter |
Data Type |
Required/Optional |
Description |
|---|---|---|---|
vector_store |
VectorStore |
Required |
Vector database instance. For details, see VectorStore. |
document_store |
Docstore |
Required |
Relational database instance. For details, see Docstore. |
embed_func |
Callable[[List[str]], Union[List[List[float]], List[Dict[int, float]]]] |
Required |
Embedding callback function. |
k |
Integer |
Optional |
Top k retrieval entries. The value range is [1, 10000], and the default value is 1. |
score_threshold |
Float |
Optional |
Retrieval score threshold. The default value is None, indicating that threshold filtering is disabled. If this parameter needs to be used, set the value within the range of [0, 1]. A larger threshold indicates stricter matching, and a smaller threshold indicates less stricter matching. |
filter_dict |
Dict |
Optional |
Dictionary consisting of retrieval criteria. Currently, only document IDs can be filtered. The filtered document IDs are passed in a list. The length of the ID list cannot exceed 1000 × 1000. The default value is {}. For example, if you need to filter the documents whose IDs are 1, 2, and 4, the input dictionary is {"document_id": [1, 2, 4]}. |
Example
from paddle.base import libpaddle
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from mx_rag.embedding.local import TextEmbedding
from mx_rag.storage.document_store import SQLiteDocstore
from mx_rag.storage.vectorstore import MindFAISS
from mx_rag.document import LoaderMng
from mx_rag.knowledge.knowledge import KnowledgeStore
from mx_rag.knowledge.handler import upload_files
from mx_rag.knowledge import KnowledgeDB
from mx_rag.retrievers import Retriever
# Step1 Register document handling tools before building a knowledge base offline.
loader_mng = LoaderMng()
# Load the document loader provided by RAG SDK or LangChain.
loader_mng.register_loader(loader_class=TextLoader, file_types=[".txt"])
# Load the document splitter provided by LangChain.
loader_mng.register_splitter(splitter_class=RecursiveCharacterTextSplitter,
file_types=[".txt"],
splitter_params={"chunk_size": 750,
"chunk_overlap": 150,
"keep_separator": False
})
# Initialize the embedding model.
emb = TextEmbedding(model_path="/path/to/acge_text_embedding", dev_id=0)
# Initialize the vector database.
vector_store = MindFAISS(x_dim=1024,
devs=[0],
load_local_index="./faiss.index",
auto_save=True
)
# Initialize the relational database for document chunks.
chunk_store = SQLiteDocstore(db_path="./sql.db")
# Initialize the relational database for knowledge management.
knowledge_store = KnowledgeStore(db_path="./sql.db")
# Add a knowledge base and its administrator.
knowledge_store.add_knowledge(knowledge_name="test", user_id='Default', role='admin')
# Initialize knowledge base management.
knowledge_db = KnowledgeDB(knowledge_store=knowledge_store,
chunk_store=chunk_store,
vector_store=vector_store,
knowledge_name="test",
user_id='Default',
white_paths=["/home"]
)
# Build an offline knowledge base and upload the domain-specific knowledge file gaokao.txt.
upload_files(knowledge=knowledge_db,
files=["/home/data/gaokao.txt"],
loader_mng=loader_mng,
embed_func=emb.embed_documents,
force=True
)
# Step 2 Initialize the retriever.
text_retriever = Retriever(vector_store=vector_store,
document_store=chunk_store,
embed_func=emb.embed_documents,
k=1,
score_threshold=0.2
)
res = text_retriever.invoke("Describe the requirements of the composition test of the 2024 National College Entrance Examination.")
print(res)