upload_files

Function

Uploads documents and saves them to a knowledge base. These operations can only be performed by the knowledge base administrator. If a document is duplicate, you can choose to forcibly overwrite it. Document data is stored in plaintext. Please pay attention to security risks. If the number of documents to be uploaded exceeds the value of max_file_count, the upload will fail. If a document fails to be added, an exception is thrown.

Prototype

from mx_rag.knowledge import upload_files
def upload_files(knowledge, files, loader_mng, embed_func, force)

Internal Workflow

Parameters

Parameter

Data Type

Required/Optional

Description

knowledge

KnowledgeDB

Required

Knowledge base object. For details about its data type, see KnowledgeDB.

files

List[str]

Required

Document path list. The path length range is [1, 1024]. The number of files cannot exceed 1,000 by default. The document path cannot be a soft link and cannot contain two consecutive dots (..).

loader_mng

LoaderMng

Required

Management class object that provides the document parsing function. For details about its data type, see LoaderMng.

embed_func

Callable[[List[str]], List[List[float]]], or dict

Required

An embedded function that converts file information into vectors.

If a callback method is passed directly, the system defaults to dense processing, using the format {'dense': Callable, 'sparse': None}.

The input dictionary format is {'dense': x, 'sparse': y}, where x and y are the callback functions of dense and sparse vectors, respectively. x and y cannot be set to None at the same time. The dense and sparse vectors can be imported simultaneously.

force

Bool

Optional

Whether to forcibly overwrite old data. If no, an exception will be thrown when a document is repeatedly uploaded. The default value is False.

Return Value

Data Type

Description

List[str]

List of files that fail to be added to a knowledge base.

Example

from mx_rag.embedding.local import TextEmbedding
from mx_rag.knowledge import KnowledgeStore, KnowledgeDB, upload_files, delete_files, FilesLoadInfo
from mx_rag.document import LoaderMng
from mx_rag.storage.document_store import SQLiteDocstore
from mx_rag.storage.vectorstore import MindFAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from mx_rag.knowledge import upload_dir
from mx_rag.document.loader import DocxLoader, PdfLoader, ExcelLoader

loader_mng = LoaderMng()
loader_mng.register_loader(DocxLoader, [".docx"])
loader_mng.register_loader(PdfLoader, [".pdf"])
loader_mng.register_loader(ExcelLoader, [".xlsx"])
# loader_mng.register_loader(ImageLoader, [".png"])
loader_mng.register_splitter(RecursiveCharacterTextSplitter,
                             [".docx", ".pdf", ".xlsx"])
# Set the NPU used for vector retrieval.
dev = 0
# Load the embedding model.
emb = TextEmbedding("/path/to/model", dev_id=dev)
# Initialize the vector database.
vector_store = MindFAISS(x_dim=1024, devs=[dev], 
                         load_local_index="/path/to/index", auto_save=True)
# Initialize the relational database for document chunks.
chunk_store = SQLiteDocstore(db_path="./sql.db")
# Initialize the relational database for knowledge management.
knowledge_store = KnowledgeStore(db_path="./sql.db")
# Add a knowledge base and its administrator.
knowledge_store.add_knowledge(knowledge_name="test", user_id='Default', role='admin')
# Initialize knowledge management.
knowledge_db = KnowledgeDB(knowledge_store=knowledge_store, chunk_store=chunk_store, vector_store=vector_store,
                           knowledge_name="test", user_id='Default', white_paths=["/home/"])
# Upload domain-specific knowledge documents.
# Call upload_files.
upload_files(knowledge=knowledge_db, files=["/path/data/test.docx"], loader_mng=loader_mng,
             embed_func=emb.embed_documents, force=True)
# Upload the directory of domain-specific documents.
# Call upload_dir.
params = FilesLoadInfo(knowledge=knowledge_db, dir_path="/path/data/files", loader_mng=loader_mng,
                       embed_func=emb.embed_documents, force=True, load_image=False)
upload_dir(params=params)
# Call delete_files.
delete_files(knowledge_db, ["test.docx"])