Software Architecture

Figure 1 shows the key modules of RAG SDK architecture.

Figure 1 Software architecture

RAG Python API: provides modular Python APIs, enabling you to flexibly call various RAG services.
Knowledge management: provides knowledge base management in RAG scenarios, allowing you to create multiple knowledge bases. You upload files such as documents, tables, and images to each base. During retrieval, you can select a knowledge base as an external knowledge base for an LLM. This module also supports document/table/image loading and parsing, document segmentation, and efficient vector retrieval technologies, significantly enhancing retrieval effectiveness and recall rates. Furthermore, it provides data support for subsequent vectorization and retrieval.
Indexing: includes corpus collection, corpus parsing, corpus splitting, and index building (vectorization) for vector retrieval. By leveraging the outputs of the knowledge management and vectorization results, the generated index enables highly efficient matching for retrieval.
Vectorization: provides the capability of calling vector models, including the embedding and reranker classes. Local deployment and serving deployment are supported. The serving framework uses text-embeddings-inference. Also, this module can load an embedding model or reranker to integrate with third-party services, including LLM and image generation services. Vectorization results are the basis of retrieval, ensuring that queries match the knowledge base.
Retrieval: vector retrieval uses the heterogeneous retrieval acceleration framework based on Ascend NPUs to provide fast and high-performance retrieval for massive data in high-dimensional space. After receiving a query, the system calls an LLM to convert the query text and generate query vectors. Then, the system searches for and re-ranks query vectors, and returns the search result to the LLM for further processing. Retrieval relies on efficient semantic matching through vector comparison.
Cache: connects to the open-source GPTcache and supports memory cache and semantic similarity cache to accelerate RAG applications. Caching the queried results can reduce repeated computing and improve retrieval speed.
Application acceleration operator layer: provides model optimization and acceleration based on Ascend affinity to achieve higher throughput and shorter response time. It also optimizes the running efficiency of core modules such as vectorization and retrieval to ensure quick response of the entire system.