From Pixels to Intelligence: Introduction to OCR Free Vision RAG using Colpali For Complex Documents

In the rapidly evolving landscape of artificial intelligence, the ability to understand and process complex documents is becoming increasingly vital. Traditional Optical Character Recognition (OCR) systems have served us well in extracting text from images, but they often fall short when it comes to interpreting the intricate visual elements that accompany textual information. In this blog we will use ColPali, a groundbreaking approach that leverages multi-vector retrieval through late interaction mechanisms and Vision Language Models (VLMs) to enhance Retrieval-Augmented Generation (RAG) processes. This blog post will take you on a deep dive into how ColPali is revolutionizing document understanding and retrieval.

In practice, these retrieval pipelines for PDF documents have a huge impact on performance but are non-trivial

The Limitations of Traditional OCR

What is OCR?

Optical Character Recognition (OCR) is a technology that converts different types of documents—such as scanned paper documents, PDFs, or images captured by a digital camera—into editable and searchable data. While OCR has made significant strides in accuracy, it primarily focuses on text extraction, often overlooking the contextual and visual elements present in complex documents.

Challenges with Complex Documents

Complex documents, such as financial reports, legal contracts, and academic papers, often contain:

Tables and Charts: These elements convey critical information that cannot be captured through text alone.
Images and Diagrams: Visual aids play a significant role in understanding the content but are often ignored by traditional OCR systems.
Layout and Formatting: The arrangement of text and visuals can significantly impact meaning, yet OCR typically treats each element in isolation.

Due to these limitations, traditional OCR can lead to incomplete or misleading interpretations of complex documents.

What is ColPali?

ColPali builds upon recent developments in VLMs, which combine the power of Large Language Models (LLMs) with Vision Transformers (ViTs). By inputting image patch embeddings through a language model, ColPali maps visual features into a latent space aligned with textual content. This alignment is crucial for effective retrieval because it ensures that the visual elements of a document contribute meaningfully to the matching process with user queries.

Key Features of ColPali

Integrated Vision Language Models (VLMs):

ColPali utilizes VLMs like PaliGemma to interpret document images effectively. These models are trained on vast datasets that include not just text but also images, diagrams, and layouts.
By understanding the relationship between visual elements and text, ColPali can provide richer insights into complex documents.

Enhanced Contextual Understanding:

Unlike traditional OCR systems that treat text as isolated data points, ColPali analyzes the entire layout of a document.
This means it can recognize how tables relate to surrounding text or how diagrams illustrate key concepts, leading to more accurate interpretations.

Dynamic Retrieval-Augmented Generation (RAG):

ColPali seamlessly integrates into RAG frameworks, allowing for real-time information retrieval based on user queries.
This dynamic approach ensures that responses are not only relevant but also contextually rich, providing users with comprehensive insights.

Beyond improved accuracy, ColPali also offers significant efficiency gains:

Simplified Indexing: By eliminating the need for complex preprocessing steps, ColPali accelerates the document indexing process. Traditional methods can be time-consuming due to the multiple stages involved in parsing and chunking documents.

Low Query Latency: ColPali maintains low latency during querying, a critical requirement for real-time applications. Its end-to-end trainable architecture optimizes the retrieval process, ensuring swift responses to user queries.

Ok now lets implement the same using Azure AI services.

Lets load the libraries.

import torch
from torch.utils.data import DataLoader
from tqdm import tqdm
from transformers import AutoProcessor
from PIL import Image
from io import BytesIO

from colpali_engine.models.paligemma_colbert_architecture import ColPali
from colpali_engine.utils.colpali_processing_utils import process_images, process_queries
from colpali_engine.utils.image_utils import scale_image, get_base64_image

import os
from dotenv import load_dotenv
load_dotenv(‘azure.env’, override=True)

Load the ColPali model

if torch.cuda.is_available():
device = torch.device(“cuda”)
if torch.cuda.is_bf16_supported():
type = torch.bfloat16
elif torch.backends.mps.is_available():
device = torch.device(“mps”)
type = torch.float32
else:
device = torch.device(“cpu”)
type = torch.float32

Lets store the model in the device selected above.

model_name = “vidore/colpali-v1.2”
model = ColPali.from_pretrained(“vidore/colpaligemma-3b-pt-448-base”, torch_dtype=type).eval()
model.load_adapter(model_name)
model = model.eval()
model.to(device)
processor = AutoProcessor.from_pretrained(model_name)

Once we have loaded the model , the first step in the process is get images from PDF.

import requests
from pdf2image import convert_from_path
from pypdf import PdfReader

def download_pdf(url):
response = requests.get(url)
if response.status_code == 200:
return BytesIO(response.content)
else:
raise Exception(f”Failed to download PDF: Status code {response.status_code}”)

def get_pdf_images(pdf_url):
# Download the PDF
pdf_file = download_pdf(pdf_url)
# Save the PDF temporarily to disk (pdf2image requires a file path)
temp_file = “temp.pdf”
with open(temp_file, “wb”) as f:
f.write(pdf_file.read())
reader = PdfReader(temp_file)
page_texts = []
for page_number in range(len(reader.pages)):
page = reader.pages[page_number]
text = page.extract_text()
page_texts.append(text)
images = convert_from_path(temp_file)
assert len(images) == len(page_texts)
return (images, page_texts)

Lets go ahead and download the PDF. Once pdf is downloaded lets fetch the PDF images.

sample_pdfs = [
{
“title”: “Attention Is All You Need”,
“url”: “https://arxiv.org/pdf/1706.03762”
}
]

We once loaded into the PDF images , texts

for pdf in sample_pdfs:
page_images, page_texts = get_pdf_images(pdf[‘url’])
pdf[‘images’] = page_images
pdf[‘texts’] = page_texts

Now lets go ahead create the embedding for each page image.

for pdf in sample_pdfs:
page_embeddings = []
dataloader = DataLoader(
pdf[‘images’],
batch_size=2,
shuffle=False,
collate_fn=lambda x: process_images(processor, x),
)
for batch_doc in tqdm(dataloader):
with torch.no_grad():
batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()}
embeddings = model(**batch_doc)
mean_embedding = torch.mean(embeddings, dim=1).float().cpu().numpy()
#page_embeddings.extend(list(torch.unbind(embeddings.to(“cpu”))))
page_embeddings.extend(mean_embedding)
pdf[’embeddings’] = page_embeddings

ColPali During indexing, we aim to strip away a lot of the complexity by using images (“screenshots”) of the document pages directly.

A Vision LLM (PaliGemma-3B) encodes the image by splitting it into a series of patches, which are fed to a vision transformer.

During runtime querying, a user query is embedded by the language model, to obtain token embeddings. ColBERT-style “late interaction” (LI) operation to efficiently match query tokens to document patches. To compute a LI(query, document) score, for each term in the query, we search for the document patch that has the most similar ColPali representation. We then sum the scores of the most similar patches for all terms of the query, to obtain the final query-document score. Intuitively, this late-interaction operation allows for a rich interaction between all terms of the query and document patches, all the while benefiting from the fast matching and offline computation offloading that more standard (bi-encoder) embedding models enable.

import numpy as np
lst_feed = []
for pdf in sample_pdfs:
url = pdf[‘url’]
title = pdf[‘title’]
for page_number, (page_text, embedding, image) in enumerate(zip(pdf[‘texts’], pdf[’embeddings’], pdf[‘images’])):
base_64_image = get_base64_image(scale_image(image,640),add_url_prefix=False)
page = {
“id”: str(hash(url + str(page_number))),
“url”: url,
“title”: title,
“page_number”: page_number,
“image”: base_64_image,
“text”: page_text,
“embedding”: embedding.tolist()
}
lst_feed.append(page)

Now once we have embedding , we need to store these embedding into a vector Store. We will use AI search as our vector store. Lets create an AI index.

from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
SimpleField,
SearchFieldDataType,
SearchableField,
SearchField,
VectorSearch,
HnswAlgorithmConfiguration,
VectorSearchProfile,
SemanticConfiguration,
SemanticPrioritizedFields,
SemanticField,
SemanticSearch,
SearchIndex
)

def create_pdf_search_index(endpoint: str, key: str, index_name: str) -> SearchIndex:
# Initialize the search index client
credential = AzureKeyCredential(key)
index_client = SearchIndexClient(endpoint=endpoint, credential=credential)

# Define vector search configuration
vector_search = VectorSearch(
algorithms=[
HnswAlgorithmConfiguration(
name=”myHnsw”,
parameters={
“m”: 4, # Default HNSW parameter
“efConstruction”: 400, # Default HNSW parameter
“metric”: “cosine”
}
)
],
profiles=[
VectorSearchProfile(
name=”myHnswProfile”,
algorithm_configuration_name=”myHnsw”,
vectorizer=”myVectorizer”
)
]
)

# Define the fields
fields = [
SimpleField(
name=”id”,
type=SearchFieldDataType.String,
key=True,
filterable=True
),
SimpleField(
name=”url”,
type=SearchFieldDataType.String,
filterable=True
),
SearchableField(
name=”title”,
type=SearchFieldDataType.String,
searchable=True,
retrievable=True
),
SimpleField(
name=”page_number”,
type=SearchFieldDataType.Int32,
filterable=True,
sortable=True
),
SimpleField(
name=”image”,
type=SearchFieldDataType.String,
retrievable=True
),
SearchableField(
name=”text”,
type=SearchFieldDataType.String,
searchable=True,
retrievable=True
),
SearchField(
name=”embedding”,
type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
searchable=True,
vector_search_dimensions=128,
vector_search_profile_name=”myHnswProfile”
)
]

# Create the index definition
index = SearchIndex(
name=index_name,
fields=fields,
vector_search=vector_search
)

# Create the index in Azure Cognitive Search
result = index_client.create_or_update_index(index)
return result

Once the index is created, we should upload the documents.

from azure.search.documents import SearchClient
credential = AzureKeyCredential(SEARCH_KEY)
index_client = SearchClient(endpoint=SEARCH_ENDPOINT, credential=credential, index_name = INDEX_NAME)

index_client.upload_documents(documents=lst_feed)

Once the document ingestion is completed , next is handling user query. As you can see in the code we create embedding for input query.

def process_query(query: str, processor: AutoProcessor, model: ColPali) -> np.ndarray:
mock_image = Image.new(‘RGB’, (224, 224), color=’white’)

inputs = processor(text=query, images=mock_image, return_tensors=”pt”)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
embeddings = model(**inputs)

return torch.mean(embeddings, dim=1).float().cpu().numpy().tolist()[0]

Now lets create a search client.

from IPython.display import display, HTML
from openai import AzureOpenAI
client = AzureOpenAI(api_key=os.environ[‘AZURE_OPENAI_API_KEY’],
azure_endpoint=os.environ[‘AZURE_OPENAI_ENDPOINT’],
api_version=os.environ[‘OPENAI_API_VERSION’])

search_client = SearchClient(
SEARCH_ENDPOINT,
index_name=INDEX_NAME,
credential=credential,
)

def display_query_results(query, response, hits=5):
html_content = f”<h3>Query text: ‘{query}’, top results:</h3>”

for i, hit in enumerate(response):
title = hit[“title”]
url = hit[“url”]
page = hit[“page_number”]
image = hit[“image”]
score = hit[“@search.score”]

html_content += f”<h4>PDF Result {i + 1}</h4>”
html_content += f'<p><strong>Title:</strong> <a href=”{url}”>{title}</a>, page {page+1} with score {score:.2f}</p>’
html_content += (
f'<img src=”data&colon;image/png;base64,{image}” style=”max-width:100%;”>’
)

display(HTML(html_content))

Now once we have retrieved the relevant images , we can use any VLM to pass in these images to ask question to user.

query = “What is the projected global energy related co2 emission in 2030?”
vector_query = VectorizedQuery(
vector=process_query(query, processor, model),
k_nearest_neighbors=3,
fields=”embedding”,
)
results = search_client.search(search_text=None, vector_queries=[vector_query])
#display_query_results(query, results)
response = client.chat.completions.create(
model=”gpt-4o”,
messages=[
{
“role”: “system”,
“content”: “””You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don’t know the answer, just say that you don’t know. Use three sentences maximum and keep the answer concise.You will be given a mixed of text, tables, and image(s) usually of charts or graphs.”””
},
{
“role”: “user”,
“content”: [
{“type”: “text”, “text”: query},
*map(lambda x: {“type”: “image_url”, “image_url”: {“url”: f’data&colon;image/jpg;base64,{x[“image”]}’, “detail”: “low”}}, results),
],
}
],
max_tokens=300,
)

print(“Answer:” + response.choices[0].message.content)

Conclusion

As we move further into an era where data is increasingly complex and multifaceted, tools like Multi Modal LLM are essential for unlocking valuable insights from our documents. By integrating advanced Vision Language Models with Retrieval-Augmented Generation techniques, this sets a new standard for document understanding that transcends traditional OCR limitations. Whether you’re a researcher looking to streamline your workflow or a developer interested in AI advancements, embracing technologies like VLM and ColPali will undoubtedly enhance your ability to navigate complex information landscapes. Stay tuned for more updates as we continue to explore the fascinating intersection of AI and document processing!

** Do check the license of the opensource models before using the same

Learn more about AI Search: https://learn.microsoft.com/en-us/azure/search/search-what-is-azure-search

Model: https://huggingface.co/vidore/colpali-v1.2

Thanks

Manoranjan Rajguru

https://www.linkedin.com/in/manoranjan-rajguru/

Microsoft Tech Community – Latest Blogs –Read More

Cart

Cart