Enhancing Retrieval-Augmented Generation with a Multimodal Knowledge Extraction and Retrieval System
The rapid evolution of AI has led to powerful tools for knowledge retrieval and question-answering systems, particularly with the rise of Retrieval-Augmented Generation (RAG) systems. This blog post introduces my capstone project, created as part of the IXN program at UCL in collaboration with Microsoft, aimed at enhancing RAG systems by integrating multimodal knowledge extraction and retrieval capabilities. The system enables AI agents to process both textual and visual data, offering more accurate and contextually relevant responses. In this post, I’ll walk you through the project’s goals, development journey, technical implementation, and outcomes.
Project Overview
The main goal of this project was to improve the performance of RAG systems by refining how multimodal data is extracted, stored, and retrieved. Current RAG systems primarily rely on text-based data, which limits their ability to generate accurate responses when queries require a combination of text and images. To address this, I developed a system capable of extracting, processing, and retrieving multimodal data from Wikimedia, allowing AI agents to generate more accurate, grounded and contextually relevant answers.
Key features include:
Multimodal Knowledge Extraction: Data from Wikimedia (text, images, tables) is preprocessed, run through the transformation pipeline, and stored in vector and graph databases for efficient retrieval.
Dynamic Knowledge Retrieval: A custom query engine, combined with an agentic approach using the ReAct agent, ensures flexible and accurate retrieval of information by dynamically selecting the best tools and strategies for each query.
The project began by addressing the limitations of existing RAG systems, particularly their difficulties with handling visual data and delivering accurate responses. After reviewing various technologies, a system architecture was developed to support both text and image data. Throughout the process, components were refined to ensure compatibility between LlamaIndex, Qdrant, and Neo4j, while optimising performance for managing large datasets. The primary challenges lay in handling the large volumes of data from Wikimedia, especially the processing of images, and refactoring the system for Dockerisation. These challenges were met through iterative improvements to the system architecture, ensuring efficient multimodal data handling and reliable deployment across environments.
Implementation Overview
This project integrates both textual and visual data to enhance RAG systems’ retrieval and response generation. The system’s architecture is split into two main processes:
Knowledge Extraction: Data is fetched from Wikimedia and transformed into embeddings for text and images. These embeddings are stored in Qdrant for efficient retrieval, while Neo4j captures the relationships between the nodes, ensuring the preservation of data structure.
Knowledge Retrieval: A dynamic query engine processes user queries, retrieving data from both Qdrant (using vector search) and Neo4j (via graph traversal). Advanced techniques like query expansion, reranking, and cross-referencing ensure the most relevant information is returned.
Tech Stack
The following technologies were used to build and deploy the system:
Python: Core programming language for data pipelines
LlamaIndex: Framework for indexing, transforming, and retrieving multimodal data
Qdrant: Vector database for similarity searches based on embeddings
Neo4j: Graph database used to store and manage relationships between data entities
Azure OpenAI (GPT-4O): Used for handling multimodal inputs, deploying models via Azure App Services
Text Embedding Ada-002: Model for generating text embeddings
Azure Computer Vision: Used for generating image embeddings
Gradio: Provides an interactive interface for querying the system
Docker and Docker Compose: Used for containerization and orchestration, ensuring consistent deployment
Implementation Details
Multimodal Knowledge Extraction
The system starts by fetching both textual and visual data from Wikimedia, using the Wikimedia API and web scraping techniques. Then the key steps in knowledge extraction implementation are:
Data Preprocessing: Text is cleaned, images are classified into categories such as plots or images for appropriate handling during later transformations, and tables are structured for easier processing.
Node Creation and Transformation: Initial LlamaIndex nodes are created from this data, which then undergo several transformations through the transformation pipeline using GPT-4O model deployed via Azure OpenAI:
Text and Table Transformations: Text data is cleaned, split into smaller chunks using semantic chunking, and new derived nodes are created from various transformations, like key entity extraction or table analysis. Each node has a unique Llamaindex ID and carries metadata such as title, context, and relationships reflecting the hierarchical structure of the Wikimedia page and parent-child relationships with new transformed nodes.
Image Transformations: Images are processed to generate descriptions, perform plot analysis, and identify key objects based on the image type, resulting in the creation of new text nodes.
Embeddings Generation: The last stage of the pipeline is to generate embeddings for images and transformed text nodes:
Text Embeddings: Generated using the text-embedding-ada-002 model deployed with Azure OpenAI on Azure App Services.
Image Embeddings: Generated using the Azure Computer Vision service.
Storage: Both text and image embeddings are stored in Qdrant with reference node IDs in the payload for fast retrieval. The full nodes and their relationships are stored in Neo4j:
Knowledge Retrieval
The retrieval process involves several key steps:
Query Expansion: The system generates multiple variations of the original query, expanding the search space to capture relevant data.
Vector Search: The expanded queries are passed to Qdrant for a similarity-based search using cosine similarity.
Reranking and Cross-Retrieval: Results are then reranked by relevance. Retrieved nodes from Qdrant contain LlamaIndex node IDs in the payload. These are used to fetch the nodes from Neo4j and then to get the nodes with original data from Wikimedia by traversing the graph, ensuring the final response is based only on original Wikipedia content.
ReAct Agent Integration: The ReAct agent dynamically manages the retrieval process by selecting tools based on the query context. It integrates with the custom-built query engine to balance AI-generated insights with the original data from Neo4j and Qdrant.
Dockerization with Docker Compose
To ensure consistent deployment across different environments, the entire application is containerised using Docker. Docker Compose orchestrates multiple containers, including the knowledge extractor, retriever, Neo4j, and Qdrant services. This setup simplifies the deployment process and enhances scalability.
Results and Outcomes
The system effectively enhances the grounding and accuracy of responses generated by RAG systems. By incorporating multimodal data, it delivers contextually relevant answers, particularly in scenarios where visual information was critical. The integration of Qdrant and Neo4j proved to be highly efficient, enabling fast retrieval and accurate results.
Additionally, a user-friendly interface built with Gradio allows users to interact with the system and compare the AI-generated responses with standard LLM output, offering an easy way to evaluate the improvements.
Here is a snapshot of the Gradio UI:
Future Development
Several directions for future development have been identified to further enhance the system’s capabilities:
Agentic Framework Expansion: A future version of the system could incorporate an autonomous tool capable of determining whether the existing knowledge base is sufficient for a query. If the knowledge base is found lacking, the system could automatically initiate a knowledge extraction process to address the gap. This enhancement would bring greater adaptability and self-sufficiency to the system.
Knowledge Graph with Entities: Expanding the knowledge graph to include key entities such as individuals, locations, and events or others appropriate for the domain. This would add considerable depth and precision to the retrieval process. The integration of such entities would provide a more comprehensive and interconnected knowledge base, improving both the relevance and accuracy of results.
Enhanced Multimodality: Future iterations could expand the system’s capabilities in handling image data. This may include adding support for image comparison, object detection, or breaking images down into distinct components. Such features would enable more sophisticated queries and increase the system’s versatility in handling diverse data formats.
Incorporating these advancements will position the system to play an important role in the evolving field of multimodal AI, further bridging the gap between text and visual data integration in knowledge retrieval.
Summary
This project demonstrates the potential of enhancing RAG systems by integrating multimodal data, allowing AI to process both text and images more effectively. Through the use of technologies like LlamaIndex, Qdrant, and Neo4j, the system delivers more grounded, contextually relevant answers at high speed. With a focus on accurate knowledge retrieval and dynamic query handling, the project showcases a significant advancement in AI-driven question-answering systems. For more insights and to explore the project, please visit the GitHub repository.
If you’d like to connect, feel free to reach out to me on LinkedIn.
Microsoft Tech Community – Latest Blogs –Read More