The Azure Multimodal AI & LLM Processing Solution Accelerator

Introducing the Azure Multimodal AI & LLM Solution Accelerator

The Azure Multimodal AI & LLM Processing Accelerator is your one-stop-shop for all AI+LLM processing use cases like content summarization, data extraction, classification and enrichment. This single accelerator supports all types of input data (text, documents, audio, image, video etc) and combines the best of Azure AI Services (e.g. Document Intelligence & AI Speech) and Large Language Models (such as the LLMs available in Azure OpenAI and Azure AI Studio) to achieve accurate, reliable and scalable automation of tasks. Best of all, it enables development teams to build and maintain all of their applications from a single codebase, allowing you to deliver use cases to production much faster and far more maintainable than if you used 5-10 different accelerators (that each only focus on specific use cases or datatypes).

Get started with the Github Repository: https://github.com/Azure/multimodal-ai-llm-processing-accelerator

This accelerator implements many of the traditional machine learning techniques that are lacking in existing GenAI applications to enable true automation of backend processing tasks, including the merging of confidence scores and metadata from Azure AI services with raw LLM outputs. This makes it possible to automatically identify and accept confident/reliable results, while automatically escalating the less reliable results to human review instead of needing to review all LLM outputs to ensure their accuracy. This will help your organization transition to true automation instead of the current state-of-play where every GenAI application is a chatbot (since every LLM result needs to be reviewed).

The accelerator also resolves many of the shortcomings of existing demo applications and code samples, including native support for the full range of real-world integration options (HTTP API, CosmosDB, SQL, Blob, Event grid, AI Search, Fabric and more), a selection of pre-built pipeline templates to allow developer teams to get started immediately (for use cases like document classification & extraction, customer contact centre processing & analytics, content summarization etc), a demo web app for showcasing the prebuilt pipelines or the customer’s custom solutions to non-technical stakeholders, and comes ready with infrastructure templates for full deployment to Azure in 7 minutes flat.

Form Field Extraction Pipeline – Example Output

An example result from the Form Extraction pipeline, which returns confidence scores and bounding boxes with each extracted value. This pipeline then automatically flags the result for manual review if any of the extracted values are below a given confidence threshold

Key Features

Pre-built processing pipelines included: The solution comes with a number of pre-built processing pipelines that can be easily customized and deployed straight to production, such as document processing, text summarization, contact center call analysis and more.
Azure Function host: The solution uses Azure Function App as the pipeline host, offering scalability, cost-efficiency and a full suite of integration options out of the box. Azure Functions also make it easy to configure rate and concurrency limits and concurrency that are crucial when dealing with large-scale production deployments (e.g. to ensure LLM endpoints are not overloaded when a batch process is triggered).
Front-end demo app included: A simple demo web app makes it easy to test backend APIs through a UI, and to share solutions and collaborate with non-technical users and business stakeholders.
Data converters and processors: Many of the core components required for multimodal processing pipelines are included, such as Azure Document Intelligence, Azure AI Speech, Azure OpenAI and more. These help you easily convert your data and Azure AI API responses into the best format for consumption by LLMs, and the pipelines are built in an open and extensible way so that you can easily incorporate custom-built or external pipeline components.
Enriched outputs & confidence scores: A number of components are included for merging the outputs of the pre-processing steps with those from the LLM. For example, merging the confidence scores, bounding boxes & writing styles from Document Intelligence with the values returned by the LLM. This allows for reliable automation of tasks instead of having to trust that the LLM is correct (or reviewing every result).
Data validation & intermediate outputs: All pipelines are built to validate that results conform to the required schema and return not just the final result but all intermediate outputs. This lets you reuse the data (E.g. the raw Azure AI Speech transcription) for other downstream tasks to avoid paying for the processing of the same input data multiple times, while also providing more useful information to the end client.
Powerful and flexible: The application is built in a way that supports both simple and complicated pipelines. This allows you to build pipelines for all of your use cases without needing to start from scratch with a new code base when you need to something a little more complex (a common occurrence with many GenAI frameworks and accelerators). A single deployment of this accelerator can support all of your backend processing pipelines, from proof of concepts to business-critical applications.
Infrastructure-as-code: An Azure Bicep template and instructions for local & cloud deployment are included, guiding the customization process and enabling immediate deployment to Azure.

Process Flow & Solution Architecture

The solution is built to support all types of data and easily integrate into all types of real-world production systems.

Common scenarios & use cases

The overall design means it can be used for the vast majority of GenAI use cases, but below are a handful of the most popular use cases that our customers are building with it:

Call centre analysis: Transcribe and diarize call centre audio with Azure AI Speech, then use Azure OpenAI to classify the call type, summarize the topics and themes in the call, analyze the sentiment of the customer, and ensure the customer service agent complied with standard procedures (e.g. following the appropriate script, outlining the privacy policy and sending the customer a Product Disclosure Statement).

Document processing: Ingest PDFs, Word documents and scanned images, extract the raw text content with Document Intelligence, then use Azure OpenAI to classify the document by type, extract key fields (e.g. contact information, document ID numbers), classify whether the document was stamped and signed, and return the result in a structured format.

Insurance claim processing: Process all emails and documents in long email chains. Use Azure Document Intelligence to extract information from the attachments, then use Azure OpenAI to generate a timeline of key events in the conversation, determine whether all required documents have been submitted, summarize the current state of the claim, and determine the next-best-action (e.g. auto-respond asking for more information, or escalate to human review for processing).

Customer email processing: Classify incoming emails into categories, summarizing their content, determining the sender’s sentiment, and triage into a severity category for human processing.

Background & Problem Statement

The promise (and challenges) of using LLMs for task automation

Most organizations have a huge number of simple and tasks and processes that consume large amounts of time and energy. These could be things like classifying and extracting information from documents, summarizing and triaging customer emails, or transcribing and running compliance tasks on contact centre call recordings. While some of these tasks can be automated with existing tools and AI approaches, they often require a lot of up-front investment to fully configure, train and customize in order to have a reliable, working solution. They can also be perform poorly when dealing with input data that is slightly different than expected, and may never be the right fit for scenarios that require the solution to be flexible or adaptable.

On the other hand, Large Language Models have emerged as a powerful and general-purpose approach that is able to handle these complex and varied situations. And more recently, with the move from text-only models to multimodal models that can incorporate text, audio and video, they are a powerful tool that we can use to automate a wide variety of everyday tasks. But while LLMs are powerful and flexible, they have their own shortcomings when it comes to providing precise and reliable outputs, and they too can be sensitive to the quality of raw and unprocessed input data.

The challenge for AI systems: Reliability

However, the biggest problems with LLMs is their lack of reliability. Unlike almost every other type of AI model, LLMs do not give reliable or usable confidence scores with their outputs (primarily due to their token-in-token-out architecture). And if an LLM is asked to provide a rating of the reliability of their response, research has shown that they can be unreliable and uncalibrated, especially for single predictions (1, 2, 3). The end result is that the only way to be confident in the results of an LLM is to review every output and turn everything into a chatbot – not exactly the automation that we hoped for.

For decades we have been using machine learning and AI models to automate tasks. In most cases where these models are used to assist in human processes, confidence scores are used to help decide whether an AI prediction can be accepted automatically, or whether it needs to be reviewed (usually referred to as ‘human-in-the-loop’ processing). The idea of this approach is to use the model confidence scores and other metadata to let the model solve the majority of the easy and routine cases while escalating all of the difficult and challenging cases (where the model has lower confidence scores) to a human. With this approach, it is common to automate 80+% of tasks with extremely high accuracy and reliability, while ensuring that the most challenging cases are always reviewed by a human.

For data processing tasks like summarization, data extraction and classification, LLMs will usually give strong results during a proof-of-concept. But most development teams will discover that it is challenging or even impossible to move from a good-enough solution to a reliable production-ready solution due to the fact that their inner workings are largely a ‘black box’. And unlike traditional machine learning pipelines, there are only a handful of things that can be tuned (such as changing the model or tweaking the prompt) to improve performance, with changes usually having far-reaching effects that change the performance across the entire dataset – a truly frustrating experience.

Unlocking the best of both worlds: Combining domain-specific AI models with LLMs

By combining the consistency, reliability and rich outputs of domain-specific AI models (such as Azure Document Intelligence, Azure AI Speech etc) with the general knowledge and flexibility of LLMs, it is possible to build solutions that are accurate, reliable, fast and cost-effective. Most importantly, it is possible to create systems that truly automate tasks and without the need for manual review of every result.

In a recent customer project that involved extracting Order IDs from scanned PDFs and phone images, we used a number of these techniques to increase the performance of GPT-4o-alone from ~60% to near-perfect accuracy:

Giving GPT-4o the image alone resulted in an overall recall of 60% of the order IDs, and it was impossible to know whether the results for a single PDF could be trusted.
Adding Document Intelligence text extraction to the pipeline meant the GPT-4o could use analyze both the image and extracted text. This increased the overall recall to over 80%.
Many images were rotated incorrectly, and GPT-4o performed poorly on these files (50% lower performance due to rotation alone). Document Intelligence returns a page rotation value in its response, and using this to correct those images prior to processing by GPT-4o drastically improved performance on those images – from 50% to over 80%.
Finally, by cross-referencing the order IDs returned by GPT-4o against the Document Intelligence result, we could assign a confidence score to each of the extracted IDs. With these confidence scores, we were able to tune and set confidence thresholds based on the type of image, resulting in 100% recall on 87% of the documents in the dataset, with the final 13% of preliminary results being automatically flagged for human review.

At the conclusion of this project, our customer was able to deploy the solution and automate the majority of their processing workload with confidence, knowing that any cases that were too challenging for the LLM would automatically be escalated for review. Reviews can now be completed in a fraction of the time thanks to the additional metadata returned with each result.

Future Roadmap, FAQs & More

For more information on the future roadmap, FAQs and more, head over to the Github repository. If you have questions about the repository, have feature requests or bug reports, or want to share your experience with the accelerator, please submit your feedback using the Github Issues section.

I hope you enjoy the accelerator and it helps you create value in your organization!

Microsoft Tech Community – Latest Blogs –Read More

Cart

Cart

The Azure Multimodal AI & LLM Processing Solution Accelerator