Class Introduction

Function

This is the entry class for knowledge base management, which provides the document management function, including adding documents, deleting documents, and obtaining all documents from a knowledge base.

Prototype

from mx_rag.knowledge import KnowledgeDB
KnowledgeDB(knowledge_store, chunk_store, vector_store, knowledge_name, white_paths, max_file_count, user_id, lock)

Parameters

Parameter

Data Type

Required/Optional

Description

knowledge_store

KnowledgeStore

Required

Saves the names of uploaded documents for knowledge base management. For details about its data types, see KnowledgeStore.

chunk_store

Docstore

Required

Stores the document chunk list. For details about its data types, see Docstore.

vector_store

VectorStore

Required

Vector database storage object. For details about its data types, see VectorStore.

knowledge_name

String

Required

Knowledge base name, which can be customized based on the knowledge base theme. The length range is [1, 1024].

white_paths

List[str]

Required

Trustlist of paths for uploading documents. The trustlist and path length ranges both are [1, 1024]. The path cannot be a soft link and cannot contain two consecutive dots (..).

A file can be uploaded only when its file path is in the trustlist.

max_file_count

Integer

Optional

Maximum number of documents that can be uploaded. The value range is [1, 8000]. You are advised not to set this parameter to a large value. The default value is 1000.

user_id

String

Required

User ID, which is used to distinguish different knowledge bases and must comply with the regular expression ^[a-zA-Z0-9_-]{6,64}$.

lock

multiprocessing.synchronize.Lock or _thread.LockType

Optional

If multiple processes or threads are required, a lock needs to be allocated when this API is called. The default value is None.

The values are as follows:

  • None: No lock is used. In this case, this API does not support concurrency.
  • multiprocessing.Lock (): process lock. In this case, this API supports multi-process calling.
  • threading.Lock (): thread lock. In this case, this API supports multi-thread calling.

Data consistency must be ensured for chunk_store and vector_store. For example, relational database files and vector database files need to be generated at the same time.

Example

import pathlib
from paddle.base import libpaddle
from mx_rag.embedding.local import TextEmbedding
from mx_rag.knowledge import KnowledgeStore, KnowledgeDB
from mx_rag.storage.document_store import SQLiteDocstore
from mx_rag.storage.vectorstore import MindFAISS
# Set the NPU used for vector retrieval.
dev = 0
# Load the embedding model.
embed_func = TextEmbedding("/path/to/model", dev_id=dev)
# Initialize the vector database.
vector_store = MindFAISS(x_dim=1024, devs=[dev], 
                         load_local_index="./faiss.index", auto_save=True)
# Initialize the relational database for document chunks.
chunk_store = SQLiteDocstore(db_path="./sql.db")
# Initialize the relational database for knowledge management.
knowledge_store = KnowledgeStore(db_path="./sql.db")
# Add a knowledge base and its administrator.
knowledge_store.add_knowledge(knowledge_name="test", user_id='Default', role='admin')
# Initialize knowledge management.
knowledge_db = KnowledgeDB(knowledge_store=knowledge_store, chunk_store=chunk_store, vector_store=vector_store,
                           knowledge_name="test", user_id="Default", white_paths=["/home/"])
file_path = pathlib.Path("./gaokao.txt")
knowledge_db.add_file(file=file_path,
                      texts=["test1", "test2"],
                      embed_func={"dense": embed_func.embed_documents},
                      metadatas=[{"source": "./gaokao.txt"}, {"source": "./gaokao.txt"}])
documents =[document.document_name for document in knowledge_db.get_all_documents()]
print(documents)
print(knowledge_db.check_document_exist(doc_name=file_path.name))

knowledge_db.delete_file(doc_name=file_path.name)
knowledge_db.delete_all()