upload_dir

Function

Uploads documents in a specified directory to a knowledge base. This operation can only be performed by the knowledge base administrator. If a document is duplicate, you can choose to forcibly overwrite it. Document data is stored in plaintext. Please pay attention to security risks. Only the files in the current directory are traversed, and the files in the subdirectories are not recursively searched. The file types that are not registered with the loader in the directory are skipped. When the number of uploaded files exceeds the maximum value of max_file_count of a knowledge base, the program exits. If a file fails to be added, an exception is thrown.

Prototype

from mx_rag.knowledge import upload_dir, FilesLoadInfo
FilesLoadInfo(knowledge, dir_path, loader_mng, embed_func, force, load_image)
def upload_dir(params: FilesLoadInfo):

Internal Workflow

Parameters

Parameter

Data Type

Required/Optional

Description

params

FilesLoadInfo

Required

Parameter object of the upload directory. For details, see Table 1.

Table 1 FilesLoadInfo types

Parameter

Data Type

Required/Optional

Description

knowledge

KnowledgeDB

Required

Knowledge base object. For details about its data type, see KnowledgeDB.

dir_path

String

Required

Path for storing knowledge documents. The path length range is [1, 1024]. The path in the directory cannot be a soft link and cannot contain two consecutive dots (..). The number of files in the upload_dir directory cannot exceed 8000 by default.

loader_mng

LoaderMng

Required

Management class object that provides the document parsing function. For details about its data type, see LoaderMng.

embed_func

Callable[[List[str]], List[List[float]]]

Required

An embedded function that converts file information into vectors.

force

Bool

Optional

Whether to forcibly overwrite old data. If no, an exception will be thrown when a document is repeatedly uploaded. The default value is False.

load_image

Bool

Optional

Whether to support image files. The default value is False.

  • If this parameter is set to False, only document types such as .docx, .txt, and .md are supported. The supported types are the intersection of the types supported by the loader and splitter methods in loader_mng.
  • If this parameter is set to True, only image types are supported. The supported types are the intersection of the types supported by the loader method in loader_mng and the set [".jpg", ".png"].

For document type parsing, embed_func must support corresponding document types. For image type parsing, embed_func must also support corresponding image types. Otherwise, an error occurs.

Return Value

Data Type

Description

List[str]

List of files that fail to be added to a knowledge base, including files of unsupported document types and files that fail to be uploaded.

Example

from mx_rag.embedding.local import TextEmbedding
from mx_rag.knowledge import KnowledgeStore, KnowledgeDB, upload_files, delete_files, FilesLoadInfo
from mx_rag.document import LoaderMng
from mx_rag.storage.document_store import SQLiteDocstore
from mx_rag.storage.vectorstore import MindFAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from mx_rag.knowledge import upload_dir
from mx_rag.document.loader import DocxLoader, PdfLoader, ExcelLoader

loader_mng = LoaderMng()
loader_mng.register_loader(DocxLoader, [".docx"])
loader_mng.register_loader(PdfLoader, [".pdf"])
loader_mng.register_loader(ExcelLoader, [".xlsx"])
# loader_mng.register_loader(ImageLoader, [".png"])
loader_mng.register_splitter(RecursiveCharacterTextSplitter,
                             [".docx", ".pdf", ".xlsx"])
# Set the NPU used for vector retrieval.
dev = 0
# Load the embedding model.
emb = TextEmbedding("/path/to/model", dev_id=dev)
# Initialize the vector database.
vector_store = MindFAISS(x_dim=1024, devs=[dev], 
                         load_local_index="/path/to/index", auto_save=True)
# Initialize the relational database for document chunks.
chunk_store = SQLiteDocstore(db_path="./sql.db")
# Initialize the relational database for knowledge management.
knowledge_store = KnowledgeStore(db_path="./sql.db")
# Add a knowledge base and its administrator.
knowledge_store.add_knowledge(knowledge_name="test", user_id='Default', role='admin')
# Initialize knowledge management.
knowledge_db = KnowledgeDB(knowledge_store=knowledge_store, chunk_store=chunk_store, vector_store=vector_store,
                           knowledge_name="test", user_id='Default', white_paths=["/home/"])
# Upload domain-specific knowledge documents.
# Call upload_files.
upload_files(knowledge=knowledge_db, files=["/path/data/test.docx"], loader_mng=loader_mng,
             embed_func=emb.embed_documents, force=True)
# Upload the directory of domain-specific documents.
# Call upload_dir.
params = FilesLoadInfo(knowledge=knowledge_db, dir_path="/path/data/files", loader_mng=loader_mng,
                       embed_func=emb.embed_documents, force=True, load_image=False)
upload_dir(params=params)
# Call delete_files.
delete_files(knowledge_db, ["test.docx"])