Migrating OCR Enhancement from GPT-4 Turbo Vision Preview to GPT-4 Turbo GA
The introduction of Optical Character Recognition (OCR) enhancement as a component of the GPT-4 Turbo Vision Preview was aimed at generating higher-quality responses for dense texts, transformed images, and number-heavy financial documents. Although, the recent announcement regarding the GPT-4 Turbo 2024-04-09 General Availability (GA) model indicated that the OCR enhancement is not included in this GA version. This blog post delves into the details of how OCR enhancement functions, the additional system prompts used for OCR enhancement, and provides a code snippet that demonstrates how to replicate the OCR enhancement behavior in the GPT-4 Turbo 2024-04-09 GA model by modifying the input prompt.
How OCR enhancement works
OCR enhancement modifies input messages before sending them to the GPT-4 Vision model using the following steps:
Find the user prompt message that contains an image.
Call the OCR API for this image and obtain the OCR text.
Add the OCR text as additional content to the user prompt message.
Add an additional system prompt message to instruct the model on how to leverage the OCR text to improve the accuracy of the result.
Why OCR enhancement is not supported in GA
Although OCR enhancement functionality provides simplicity by orchestrating OCR API call and prompt modification, it lacks customization of OCR technology and prompt instructions. Running the OCR enhancement process manually provides the following benefits:
Flexibility to choose a different version of OCR (i.e., documents with complex layout and table may benefit from markdown support instead of using plain text).
Flexibility to modify system instructions how OCR text is leveraged by GPT model (i.e., based on document type/quality/etc, leverage OCR text for numbers extraction but rely on GPT vision for signature detection, etc).
Agility to run OCR enhancement for prompt with multiple images (i.e., multi-page documents, comparison scenario, etc). Preview API only supports OCR enhancement for prompts with 1 image.
Running OCR enhancement manually
The goal of the code sample is to illustrate how OCR enhancement can be done manually. It creates two sample GPT payloads:
The first payload is with OCR enhancement enabled.
The second payload is identical to the first one but with 2 additional messages with OCR instructions, added the same way as OCR enhancement is doing for Preview model in Azure OpenAI service backend.
Prerequisites
An Azure OpenAI resource(s) with deployments of GPT-4 Turbo Vision Preview and GPT-4 Turbo 2024-04-09 GA models.
A Document Intelligence resource to call OCR API.
Install Document Intelligence Python SDK:
pip install azure-ai-documentintelligence
Setup environment variables
Create and assign environment variables for resource endpoints and API keys and load them in Python. Also, provide deployment names for both GPT models by replacing ‘<your-deployment-name>’ strings.
import os
# GPT-4 Turbo Vision Preview model
GPT4V_PREVIEW_ENDPOINT = os.getenv(“AZURE_GPT4V_PREVIEW_ENDPOINT”)
GPT4V_PREVIEW_KEY = os.getenv(“AZURE_GPT4V_PREVIEW_KEY”)
GPT4V_PREVIEW_DEPLOYMENT = ‘<your-deployment-name>’
# GPT-4 Turbo 2024-04-09 General Availability (GA) model
GPT4_GA_ENDPOINT = os.getenv(“AZURE_GPT4_GA_ENDPOINT”)
GPT4_GA_KEY = os.getenv(“AZURE_GPT4_GA_KEY”)
GPT4_GA_DEPLOYMENT = ‘<your-deployment-name>’
# Azure Document Intelligence API
DI_ENDPOINT = os.getenv(“AZURE_DI_ENDPOINT”)
DI_KEY = os.getenv(“AZURE_DI_KEY”)
Sample GPT payload
Python code below creates sample Json payload for GPT-4 Turbo Vision Preview API with OCR enhancement enabled. It uses a sample image of Japanese receipt as input and asks to extract Total from the receipt. This receipt is selected because without OCR enhancement GPT-4 Turbo Vision gives wrong answer – 5000, but with OCR enhancement answer is correct – 4500:
import requests
import base64
# Sample image data
IMAGE_BYTES = requests.get(“https://documentintelligence.ai.azure.com/documents/samples/prebuilt/receipt-japanese.jpg”).content
encoded_image = base64.b64encode(IMAGE_BYTES).decode(‘ascii’)
payload_sample = {
“messages”: [
{
“role”: “system”,
“content”: [
{
“type”: “text”,
“text”: “You are AI assistance to help extract information.”
}
]
},
{
“role”: “user”,
“content”: [
{
“type”: “image_url”,
“image_url”: {
“url”: “data:” + “image/jpeg;base64,” + encoded_image
}
},
{
“type”: “text”,
“text”: “Receipt Total as number. Just the number, no currency symbol or additional text.”
}
]
}
],
“temperature”: 0.0, “max_tokens”: 1000
}
Run OCR using Azure Document Intelligence
Define the function which calls Document Intelligence OCR API for the image and returns OCR content:
from azure.ai.documentintelligence import DocumentIntelligenceClient
from azure.core.credentials import AzureKeyCredential
def run_ocr(image_bytes):
document_analysis_client = DocumentIntelligenceClient(endpoint=DI_ENDPOINT, credential=AzureKeyCredential(DI_KEY))
poller = document_analysis_client.begin_analyze_document(‘prebuilt-read’, analyze_request=image_bytes, content_type=’application/octet-stream’)
return poller.result().content
Manually OCR-enhanced GPT payload
Below is an example of manually OCR-enhanced payload. It has 2 changes to the original payload above:
System message is added after all system messages in original prompt with text DEFAULT_OCR_SYSTEM_PROMPT (see below).
OCR text content is added as a first element of user message content. OCR text is produced by “run_ocr” function above.
Same way users can modify any of their own prompts to achieve same results as current OCR enhancements. Payload below illustrate this update: additional system message is at lines 15-23, additional user message is at lines 27-30:
DEFAULT_OCR_PROMPT = “**OCR text:**”
DEFAULT_OCR_SYSTEM_PROMPT = f”Additional instructions: – You’re known to be good at recognizing equations and complete partially occluded text. Additional information has been generated by an OCR model regarding text present in the image in format of {DEFAULT_OCR_PROMPT}. However, you should refrain from incorporating the text information into your description, and only consult the OCR text when the text within the image is unclear to you. Follow your original analysis if OCR text does not help. “
payload_manually_ocr_enhanced = {
“messages”: [
{
“role”: “system”,
“content”: [
{
“type”: “text”,
“text”: “You are AI assistance to help extract information.”
}
]
},
{
“role”: “system”,
“content”: [
{ # OCR enhanced system message with additional instructions
“type”: “text”,
“text”: DEFAULT_OCR_SYSTEM_PROMPT
}
]
},
{
“role”: “user”,
“content”: [
{ # OCR enhanced user message with OCR text
“type”: “text”,
“text”: f”{DEFAULT_OCR_PROMPT} {run_ocr(IMAGE_BYTES)}”
},
{
“type”: “image_url”,
“image_url”: {
“url”: “data:” + “image/jpeg;base64,” + encoded_image
}
},
{
“type”: “text”,
“text”: “Receipt Total as number. Just the number, no currency symbol or additional text.”
}
]
}
],
“temperature”: 0.0, “max_tokens”: 1000
}
Compare results
Code below makes 4 different Azure OpenAI calls:
GPT-4 Turbo Vision Preview model with OCR Enhancement Disabled.
GPT-4 Turbo Vision Preview model with OCR Enhancement Enabled.
GPT-4 Turbo Vision Preview model for manually OCR-enhanced payload (OCR enhancement disabled).
GPT-4 Turbo 2024-04-09 GA model for manually OCR-enhanced payload.
def run_gpt(scenario, payload, url, api_key):
response_json = requests.post(url, json=payload, headers={“Content-Type”: “application/json”, “api-key”: api_key}).json()
print(f”{scenario}:n{response_json[‘usage’]}n{response_json[‘choices’][0][‘message’][‘content’]}n”)
return
# 1. GPT-4 Turbo Vision Preview model with OCR Enhancement disabled
payload_sample[‘enhancements’] = {‘ocr’: {‘enabled’: False}} # Disabled OCR enhancement
run_gpt(“1. GPT-4 Turbo with Vision Preview Results with OCR enhancement Disabled”,
payload_sample, f”{GPT4V_PREVIEW_ENDPOINT}/openai/deployments/{GPT4V_PREVIEW_DEPLOYMENT}/extensions/chat/completions?api-version=2023-12-01-preview”, GPT4V_PREVIEW_KEY)
# 2. GPT-4 Turbo Vision Preview model with OCR Enhancement enabled
payload_sample[‘enhancements’] = {‘ocr’: {‘enabled’: True}} # Enabled OCR enhancement
run_gpt(“2. GPT-4 Turbo with Vision Preview Results with OCR enhancement Enabled”,
payload_sample, f”{GPT4V_PREVIEW_ENDPOINT}/openai/deployments/{GPT4V_PREVIEW_DEPLOYMENT}/extensions/chat/completions?api-version=2023-12-01-preview”, GPT4V_PREVIEW_KEY)
# 3. GPT-4 Turbo Vision Preview model with manually OCR-enhanced payload (OCR enhancement disabled)
run_gpt(“3. GPT-4 Turbo with Vision Preview Results for manually OCR-enhanced payload (OCR enhancement Disabled)”,
payload_manually_ocr_enhanced, f”{GPT4V_PREVIEW_ENDPOINT}/openai/deployments/{GPT4V_PREVIEW_DEPLOYMENT}/chat/completions?api-version=2023-12-01-preview”, GPT4V_PREVIEW_KEY)
# 4. GPT-4 Turbo 2024-04-09 GA model with manually OCR-enhanced payload.
run_gpt(“4. GPT-4 Turbo 2024-04-09 GA Results for manually OCR-enhanced payload”,
payload_manually_ocr_enhanced, f”{GPT4_GA_ENDPOINT}/openai/deployments/{ GPT4_GA_DEPLOYMENT}/chat/completions?api-version=2024-02-15-preview”,GPT4_GA_KEY)
Script outputs GPT results for 4 scenarios and as well as token usage (see below).
Number of “prompt_tokens” for scenario #2 (OCR enhancement enabled) is 1032. It is larger than 801 “prompt_tokens” for scenario #1 (OCR enhancement disabled) even exactly same payload payload_sample was sent to the Azure OpenAI API. It happens because OCR enhancement adds OCR text into the input prompt, and it counts as additional tokens.
Results #2 (with OCR enhancement enabled) and #3 (manually OCR-enhanced payload) are identical. Most important is that number of “prompt_tokens” are identical and equals to 1032. It illustrates that manually OCR-enhanced payload payload_manually_ocr_enhanced is exactly same as the original payload_sample modified by OCR enhancement by Azure OpenAI service backend.
Results #3 (GPT-4 Turbo with Vision Preview) and #4 (GPT-4 Turbo 2024-04-09 GA) are identical for payload_manually_ocr_enhanced, but they may be slightly different for other prompts and/or images since GPT-4 Turbo 2024-04-09 GA model may behave differently than GPT-4 Turbo Vision Preview model.
1. GPT-4 Turbo with Vision Preview Results with OCR enhancement Disabled:
{‘completion_tokens’: 2, ‘prompt_tokens’: 801, ‘total_tokens’: 803}
5000
2. GPT-4 Turbo with Vision Preview Results with OCR enhancement Enabled:
{‘completion_tokens’: 2, ‘prompt_tokens’: 1032, ‘total_tokens’: 1034}
4500
3. GPT-4 Turbo with Vision Preview Results for manually OCR-enhanced payload (OCR enhancement Disabled):
{‘completion_tokens’: 2, ‘prompt_tokens’: 1032, ‘total_tokens’: 1034}
4500
4. GPT-4 Turbo 2024-04-09 GA results for manually OCR-enhanced payload:
{‘completion_tokens’: 2, ‘prompt_tokens’: 1032, ‘total_tokens’: 1034}
4500
We hope this blog post helps to understand exactly how OCR enhancement works and allows you to successfully migrate to GPT-4 Turbo GA model.
Thanks
Microsoft Tech Community – Latest Blogs –Read More