Class Introduction

Function

After vectorization is performed on the question to be searched for, a proximity algorithm is used to obtain the top k similar vector IDs in a vector database, and the top k related documents in a relational database are obtained by using the vector IDs. This class inherits the langchain_core.retrievers.BaseRetriever class and calls the invoke method of the base class to enable retrieval. The type of the question to be retrieved is string, and the length cannot exceed 1 million characters.

Prototype

from mx_rag.retrievers import Retriever
# All parameters must be passed through keyword parameters.
Retriever(vector_store, document_store, embed_func, k, score_threshold)

Dependency

Parameters

All parameters must be passed through keyword parameters.

Parameter

Data Type

Required/Optional

Description

vector_store

VectorStore

Required

Vector database instance. For details, see VectorStore.

document_store

Docstore

Required

Relational database instance. For details, see Docstore.

embed_func

Callable[[List[str]], Union[List[List[float]], List[Dict[int, float]]]]

Required

Embedding callback function.

k

Integer

Optional

Top k retrieval entries. The value range is [1, 10000], and the default value is 1.

score_threshold

Float

Optional

Retrieval score threshold. The default value is None, indicating that threshold filtering is disabled.

If this parameter needs to be used, set the value within the range of [0, 1]. A larger threshold indicates stricter matching, and a smaller threshold indicates less stricter matching.

filter_dict

Dict

Optional

Dictionary consisting of retrieval criteria. Currently, only document IDs can be filtered. The filtered document IDs are passed in a list. The length of the ID list cannot exceed 1000 × 1000. The default value is {}. For example, if you need to filter the documents whose IDs are 1, 2, and 4, the input dictionary is {"document_id": [1, 2, 4]}.

Example

from paddle.base import libpaddle
from langchain_community.document_loaders import TextLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from mx_rag.embedding.local import TextEmbedding
from mx_rag.storage.document_store import SQLiteDocstore
from mx_rag.storage.vectorstore import MindFAISS
from mx_rag.document import LoaderMng
from mx_rag.knowledge.knowledge import KnowledgeStore
from mx_rag.knowledge.handler import upload_files
from mx_rag.knowledge import KnowledgeDB
from mx_rag.retrievers import Retriever
# Step1 Register document handling tools before building a knowledge base offline.
loader_mng = LoaderMng()
# Load the document loader provided by RAG SDK or LangChain.
loader_mng.register_loader(loader_class=TextLoader, file_types=[".txt"])
# Load the document splitter provided by LangChain.
loader_mng.register_splitter(splitter_class=RecursiveCharacterTextSplitter,
                             file_types=[".txt"],
                             splitter_params={"chunk_size": 750,
                                              "chunk_overlap": 150,
                                              "keep_separator": False
                                              })
# Initialize the embedding model.
emb = TextEmbedding(model_path="/path/to/acge_text_embedding", dev_id=0)
# Initialize the vector database.
vector_store = MindFAISS(x_dim=1024,
                         
                         devs=[0],
                         load_local_index="./faiss.index",
                         auto_save=True
                         )
# Initialize the relational database for document chunks.
chunk_store = SQLiteDocstore(db_path="./sql.db")
# Initialize the relational database for knowledge management.
knowledge_store = KnowledgeStore(db_path="./sql.db")
# Add a knowledge base and its administrator.
knowledge_store.add_knowledge(knowledge_name="test", user_id='Default', role='admin')
# Initialize knowledge base management.
knowledge_db = KnowledgeDB(knowledge_store=knowledge_store,
                           chunk_store=chunk_store,
                           vector_store=vector_store,
                           knowledge_name="test",
                           user_id='Default',
                           white_paths=["/home"]
                           )
# Build an offline knowledge base and upload the domain-specific knowledge file gaokao.txt.
upload_files(knowledge=knowledge_db,
             files=["/home/data/gaokao.txt"],
             loader_mng=loader_mng,
             embed_func=emb.embed_documents,
             force=True
             )
# Step 2 Initialize the retriever.
text_retriever = Retriever(vector_store=vector_store,
                           document_store=chunk_store,
                           embed_func=emb.embed_documents,
                           k=1,
                           score_threshold=0.2
                           )
res = text_retriever.invoke("Describe the requirements of the composition test of the 2024 National College Entrance Examination.")
print(res)