RAG On PDF With Text And Embedded Images, With Citations Referencing Image Answering User Query
In today’s era of Generative AI, customers can unlock valuable insights from their unstructured or structured data to drive business value. By infusing AI into their existing or new products, customers can create powerful applications, which puts the power of AI into the hands of their users. For these Generative AI applications to work on customers data, implementing efficient RAG (Retrieval augment generation) solution is key to make sure the right context of the data is provided to the LLM based on the user query.
Customers have PDF documents with text and embedded figures which could be images or diagrams holding valuable information that they would like to use as a context to the LLM to answer a given user query. Parsing those PDFs to implement an efficient RAG solution is challenging, especially when the customer wants to maintain the relationship between the text and extracted image context used to answer the user query. Also, referencing the image as part of the citation which answers the user query is also challenging if the images are not extracted and are retrievable. This blog post is addressing the challenge of extracting PDF content with text and images as part of the RAG solution, where the relationship between the searchable text context with any of its extracted images is maintained so that the images can be retrieved as references within the citations.
Below we outline a simple architecture to build a RAG application on PDF data, where the extracted image content within the PDF is also retrievable as part of the LLM output as part of citation references.
Solution Overview
Azure OpenAI Service provides REST API access to OpenAI’s powerful language models including GPT-4o, GPT-4o mini, GPT-4 Turbo with Vision, GPT-4, GPT-3.5-Turbo, and Embeddings model series. These models can be easily adapted to your specific task including but not limited to content generation, summarization, image understanding, semantic search, and natural language to code translation.
Azure AI Search provides secure information retrieval at scale over user-owned content in traditional and generative AI search applications. Information retrieval is foundational to any app that surfaces text and vectors. Common scenarios include catalog or document search, data exploration, and increasingly feeding query results to prompts based on your proprietary grounding data for conversational and copilot search.
Azure Blob Storage is Microsoft’s object storage solution for the cloud. Blob Storage is optimized for storing massive amounts of unstructured data. Unstructured data is data that doesn’t adhere to a particular data model or definition, such as text or binary data.
Azure Functions is a serverless solution that allows you to write less code, maintain less infrastructure, and save on costs. Instead of worrying about deploying and maintaining servers, the cloud infrastructure provides all the up-to-date resources needed to keep your applications running.
In this solution, we leverage Azure OpenAI models for text generation and embeddings, Azure AI Search for information retrieval grounded in our data, Azure Blob for storing raw PDF files and the prepared data utilized by Azure AI Search for efficient data retrieval, and Azure Functions as a serverless component to prepare the data for populating the Azure AI Search index.
Figure 1: Document data management
The document data management flow operates as follows:
A raw PDF document file is uploaded to Azure Blob storage.
An event trigger in Azure Blob invokes an Azure Function, which then splits large PDFs, extracts text chunks, and maps images to the corresponding text chunks.
Once the Azure Function prepares the data, it uploads the prepaired data back to Azure Blob storage.
An index scheduler is then invoked to initiate the indexing process for the prepared data.
The prepared data is retrieved from Azure Blob by Azure AI Search.
Azure AI Search processes the text chunks in parallel, using the Azure OAI embedding model to vectorize the text.
The Azure AI Search index is populated with the prepared data and vectorized chunks. Additionally, it maps the relevant images to their corresponding text chunks using a custom index field.
Figure 2: Application runtime
The application runtime flow operates as follows:
User makes a query request through the client-side application.
The server-side AI chatbot application forwards the user’s query to Azure OAI. Note: This step is an ideal point to implement controls such as safety measures using the Azure AI Content Safety service.
Azure OAI, given the user’s query, makes a request to Azure AI Search to retrieve relevant text and images. Notably, the responsibility for making the request to Azure AI Search shifts from the application code to the AIAO service itself.
With the user’s query and the relevant text retrieved from Azure AI Search, AIAO generates the response.
AIAO returns the generated response and associated metadata (e.g., citation data) to the server-side AI chatbot application.
The server-side AI chatbot application remaps the response data, creating a payload that includes text and image URLs. This step is another excellent point to implement additional controls before sending the payload back to the client-side application.
The server-side AI chatbot application sends the response to the user’s query back to the client-side application.
The client-side application displays the generated response text and downloads any images from Azure Blob, rendering them in the user interface.
Note: Steps 9a and 9b are conceptual components of the reference architecture but are not currently part of the deployable artifact. We welcome your feedback and may potentially extend the implementation to include these steps.
Figure 3: Azure Blob directory and file structure
The directory and file structure serve the following primary purposes:
Azure Function: To retrieve raw PDF files and upload the prepared data back. The event trigger is configured to receive events under the raw_data directory.
Azure AI Search: To download the prepared data for populating the index. The Azure AI Search data source is configured to retrieve data from the prepared_data directory.
Deployment and Implementation Details
In this section, we will delve deeper into the essential components of our solution and their specific functionalities. We will start with the implementation, followed by an overview of the deployment process.
Implementation Details
Before Azure AI Search can index a raw PDF document, it must undergo pre-processing, facilitated by an Azure Function. This function is configured to listen to Azure Blob Storage events and is triggered whenever a new document is uploaded. The function performs the following tasks:
Splits a Single PDF into Text Chunks: This involves breaking down the PDF document into smaller text chunks.
Generates a JSON File: The text chunks are then organized into a JSON file, which is subsequently uploaded back to the Azure Blob Storage. Each element in the JSON array represents a text chunk.
Extracts and Maps Images: Images from the PDF are extracted and mapped to their corresponding text chunks. Specifically, images found on a given PDF page are associated with text chunks from the same page.
Once the data is prepared, the Azure AI Search indexer is activated to handle the actual ingestion and index population. During this process, Azure AI Search skills are employed to map the data to the fields defined in the index. Upon completion of the indexing, the ingested data is mapped to the specified fields, making it ready for query execution.
Deployment
Having gained a comprehensive understanding of how Azure Functions and Azure AI Search are utilized in this solution, the next step is to deploy the solution and explore the demo application. This will enable you to see the implementation in action and understand its practical applications.
To deploy the solution, refer to the GitHub repository, follow the provided steps, and complete the sections on prerequisites and deployment.
Extending Deployment with Your Own Documents
The provided repository includes a small snippet of the Azure AKS documentation, offering a glimpse into the end-user experience. However, you may be interested in trying it out with your own PDF documents. The section on extending deployment with your own documents was created with this in mind. It provides a quick and straightforward way for you to incorporate your documents into the solution and start querying them using the already operational demo application.
By following these steps, you will be able to efficiently manage and query your PDF documents using Azure AI Search, ensuring a seamless and effective search experience.
Clean up
After you’ve tested the solution, you can clean up all the Azure resources created by deleting the deployment. To do so, follow the steps in the cleanup section.
Conclusion
In this post, we demonstrated how to use Azure OpenAI and Azure AI Search to build a Retrieval-Augmented Generation (RAG) application with your own data. By offloading AI Search communication to Azure OpenAI, this solution not only enhances text-based queries but also provides a powerful way to identify and retrieve relevant images based on the user’s query. This capability ensures that your query responses are enriched with relevant visual content whenever available.
If you have any feedback, questions, or suggestions to help improve this solution, please submit them through the GitHub repository. We welcome all input and look forward to hearing from you.
Microsoft Tech Community – Latest Blogs –Read More