upload_dir
Function
Uploads documents in a specified directory to a knowledge base. This operation can only be performed by the knowledge base administrator. If a document is duplicate, you can choose to forcibly overwrite it. Document data is stored in plaintext. Please pay attention to security risks. Only the files in the current directory are traversed, and the files in the subdirectories are not recursively searched. The file types that are not registered with the loader in the directory are skipped. When the number of uploaded files exceeds the maximum value of max_file_count of a knowledge base, the program exits. If a file fails to be added, an exception is thrown.
Prototype
from mx_rag.knowledge import upload_dir, FilesLoadInfo FilesLoadInfo(knowledge, dir_path, loader_mng, embed_func, force, load_image) def upload_dir(params: FilesLoadInfo):
Internal Workflow

Parameters
Parameter |
Data Type |
Required/Optional |
Description |
|---|---|---|---|
params |
FilesLoadInfo |
Required |
Parameter object of the upload directory. For details, see Table 1. |
Parameter |
Data Type |
Required/Optional |
Description |
|---|---|---|---|
knowledge |
KnowledgeDB |
Required |
Knowledge base object. For details about its data type, see KnowledgeDB. |
dir_path |
String |
Required |
Path for storing knowledge documents. The path length range is [1, 1024]. The path in the directory cannot be a soft link and cannot contain two consecutive dots (..). The number of files in the upload_dir directory cannot exceed 8000 by default. |
loader_mng |
LoaderMng |
Required |
Management class object that provides the document parsing function. For details about its data type, see LoaderMng. |
embed_func |
Callable[[List[str]], List[List[float]]] |
Required |
An embedded function that converts file information into vectors. |
force |
Bool |
Optional |
Whether to forcibly overwrite old data. If no, an exception will be thrown when a document is repeatedly uploaded. The default value is False. |
load_image |
Bool |
Optional |
Whether to support image files. The default value is False.
|
For document type parsing, embed_func must support corresponding document types. For image type parsing, embed_func must also support corresponding image types. Otherwise, an error occurs.
Return Value
Data Type |
Description |
|---|---|
List[str] |
List of files that fail to be added to a knowledge base, including files of unsupported document types and files that fail to be uploaded. |
Example
from mx_rag.embedding.local import TextEmbedding
from mx_rag.knowledge import KnowledgeStore, KnowledgeDB, upload_files, delete_files, FilesLoadInfo
from mx_rag.document import LoaderMng
from mx_rag.storage.document_store import SQLiteDocstore
from mx_rag.storage.vectorstore import MindFAISS
from langchain.text_splitter import RecursiveCharacterTextSplitter
from mx_rag.knowledge import upload_dir
from mx_rag.document.loader import DocxLoader, PdfLoader, ExcelLoader
loader_mng = LoaderMng()
loader_mng.register_loader(DocxLoader, [".docx"])
loader_mng.register_loader(PdfLoader, [".pdf"])
loader_mng.register_loader(ExcelLoader, [".xlsx"])
# loader_mng.register_loader(ImageLoader, [".png"])
loader_mng.register_splitter(RecursiveCharacterTextSplitter,
[".docx", ".pdf", ".xlsx"])
# Set the NPU used for vector retrieval.
dev = 0
# Load the embedding model.
emb = TextEmbedding("/path/to/model", dev_id=dev)
# Initialize the vector database.
vector_store = MindFAISS(x_dim=1024, devs=[dev],
load_local_index="/path/to/index", auto_save=True)
# Initialize the relational database for document chunks.
chunk_store = SQLiteDocstore(db_path="./sql.db")
# Initialize the relational database for knowledge management.
knowledge_store = KnowledgeStore(db_path="./sql.db")
# Add a knowledge base and its administrator.
knowledge_store.add_knowledge(knowledge_name="test", user_id='Default', role='admin')
# Initialize knowledge management.
knowledge_db = KnowledgeDB(knowledge_store=knowledge_store, chunk_store=chunk_store, vector_store=vector_store,
knowledge_name="test", user_id='Default', white_paths=["/home/"])
# Upload domain-specific knowledge documents.
# Call upload_files.
upload_files(knowledge=knowledge_db, files=["/path/data/test.docx"], loader_mng=loader_mng,
embed_func=emb.embed_documents, force=True)
# Upload the directory of domain-specific documents.
# Call upload_dir.
params = FilesLoadInfo(knowledge=knowledge_db, dir_path="/path/data/files", loader_mng=loader_mng,
embed_func=emb.embed_documents, force=True, load_image=False)
upload_dir(params=params)
# Call delete_files.
delete_files(knowledge_db, ["test.docx"])