Designing and running a Generative AI Platform based on Azure AI Gateway
Designing and Operating a Generative AI Platform
Summary
Are you in a platform team who has been tasked with building an AI Platform to serve the Generative AI needs of your internal consumers? What does that mean? It’s a daunting challenge to be set, and even harder if you’re operating in a highly regulated environment.
As enterprises scale out usage of Generative AI past a few initial use-cases they will face into a new set of challenges – scaling, onboarding, security and compliance to name a few.
This article discusses such challenges and approaches to building an AI Platform to serve your internal consumers.
Needs and more needs
To successfully run Generative AI at scale, organisations are utilising new features in API Management platforms such as Azure API Management’s AI Gateway (https://techcommunity.microsoft.com/t5/azure-integration-services-blog/introducing-genai-gateway-capabilities-in-azure-api-management/ba-p/4146525). The key to success for these platforms will be based on effective CI / CD and automation strategies. As we will see, an architecture to run Azure Open AI safely at scale involves safely deploying and managing lots of moving pieces, which together solve for scenarios such as:
How many Azure Open AI (AOAI) APIs should I create?
How do I version AOAI APIs?
How do I support consumers with different content-safety and model requirements?
How do I restrict throughput per Consumer, per deployment?
How do I scale out AOAI services?
How do I log all prompts and responses including streaming, without disruption?
What other value add services should a platform offer consumers?
Further, we need to understand how common services and libraries involved in building Generative AI Services fit into the architecture. We can build the best AI Platform in the world but if our consumers find they cannot use common Generative AI Libraries with it, have we really succeeded?
This document iterates through use-cases to build out a reference implementation that can safely run Azure API Management (AI Gateway) and Azure Open AI at scale, supporting most common libraries and services. You can find a reference implementation here:
https://github.com/graemefoster/APImAIPlatform
Target Azure Architecture
Decision Matrix
You might not need all the components. This matrix should help you understand what each iteration brings, allowing you to make an informed decision on when to stop.
AOAI
APIm / AOAI
Proxy / APIm / AOAI
Proxy / APIm / AOAI / Defender
Chargeback
✗
✓
✓
✓
Prompt flow (JWT & key)
✗
✓
✓
✓
Advanced Logging (PII redaction / streaming)
✗
✗
✓
✓
SIEM
✗
✗
✗
✓
Out of scope
We are focusing on the requirements for running Azure Open AI and Generative AI services, rather than specific application stacks. A Generative AI Orchestrator may involve multiple services such an Azure App Service, Azure AI Search and Storage. These will vary between applications and are considered out-of-scope for this document.
Roles
We have identified the following roles involved in running Generative AI Applications. NB: these may not map one to one to people. We leave team structure to your own personal choice.
Role
Responsibilities
Gen AI Developers
Building and testing AI Orchestrators including promptflow
GEN AI Operators
Understanding consumer usage, prompt performance, overall system response.
AI Platform
Managing Azure Open AI services, access, security, and audit logging
AI Governance
Monitoring AI safety / abuse / groundedness
The remainder of the document will introduce fictitious use-cases and map them to the above roles. We will iterate on the platform and show how it provides for the requirements. It can be overwhelming to see a target architecture without insight into why individual services exist. My hope is that by the end of the document you see why each service is there. Also, this is an evolving space. As services gain new features, knowing why something is there will enable you to consolidate accordingly.
Gen AI Application Engineers
Let’s start with a simple use-case. Ultimately, we want our feature teams to deliver applications that deliver real business value. There is no application or business value without the Application Engineers.
Use Case:
So that I can build Generative AI Applications
As a Gen AI Application Engineer
I need access to Azure Open AI API models
Use Case:
So that I can iterate on a Generative AI Application
As a Gen AI Application Engineer
I want to see diagnostics information like latency and prompt response time of my prompts
These two use-cases can be delivered with a single Azure Open AI resource:
It’s simple, but it might be all you need. We run the Orchestrator in a PaaS like Azure App Service, send telemetry to Application Insights, and use a single Azure Open AI service.
We can use Managed Identities for secure authentication, and content-filters inside Azure Open AI to keep us safe. They can do things like detect Jailbreaks and moderate responses.
We can provision PTUs (Provisioned Throughput Units https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/provisioned-throughput-onboarding) to guarantee capacity for Generative AI.
If you can stop now, then great. But most organisations will soon hit the next set of challenges.
I need to fairly share my PTU amongst multiple consumers
My consumers need access to embedding models which are not part of my PTU, and one deployment isn’t enough
I want more control over the identities accessing Azure Open AI such my own IdP
I need to log all prompts and responses for a central AI safety team
If this sounds like you, then read on as we build a platform!
Gen AI Platform Engineers
Use Case:
So that my organisation can run many AI applications on a PTU
As a Gen AI Platform Engineer
I need to share a PTU fairly across consumers
Use Case:
So that I can charge-back my PTU deployments
As a Gen AI Platform Engineer
I need to report over token usage metrics
Use Case:
So that we can leverage existing security controls
As a Gen AI Platform Engineer
I need to use identities from an existing IdP
To deliver this feature we’re going to reach for an AI Gateway. These are becoming commonplace in APIm products. In our case let’s drop Azure API Management with its AI Gateway components into the architecture and see what happens:
This is nice – we’ve now introduced an AI gateway which will give us lots of extra functionality
We’ll expose an API from Azure APIm that looks and feels like the Azure Open AI API. This will make it easier for our consumers using frameworks that ‘expect’ the AOAI surface area. It also makes it simpler for our platform team as there’s only one API for them to maintain.
We can create new versions of this API as Open AI releases new versions.
We can also restrict the surface area of the Azure Open AI API.
Let’s say we don’t want to expose Assistants, or the ability to upload files. With this model we can just choose not to expose those APIs.
We can use APIm’s load balanced pools to balance requests.
We can get it to prioritise our PTU and fail over to PAYG if we run out of capacity.
This will reduce the likelihood of 429’s getting back to our consumers.
We can use APIm to authorise incoming requests using our own IdP or a Platform Entra Application.
This lets us use custom roles and have more control over the JWT’s our consumers need.
We can use APIm policies like token-rate-limiting to fairly share our PTU stopping greedy consumers.
We’ll use “products” to lower the blast radius of breaking our consumers.
When (not if!) the platform gets popular we will need to support many consumers with different token limits. Products let us model this as lots of small policy files. These will be easier to manage than one big one.
We can use policies that emit token count metrics (including for streaming endpoints) allowing chargeback scenarios.
GEN AI Engineers… part II
This is looking great… it’s an amazing platform but the Engineers are not happy… What we’ve built is a brilliant launchpad for engineers who want to build Generative AI Applications, but it doesn’t cater for all tooling they might want.
Use Case:
So that I can support libraries like Prompt flow
As a Gen AI Platform Engineer
I need to cater for assumptions the libraries make on authentication
Some AI Libraries were built before the idea of AI Platforms existed and have taken tight dependencies on the way the ‘old world’ worked. These will be fixed over time but for now our platform is going to have to deal with them.
Let’s take Promptflow as an example. Our platform makes heavy use of API Management’s AI Gateway products to reduce the blast radius of change.
As of August 2024, APIm requires a Subscription Key to trigger Product behaviour. Our fictitious security team has mandated OAuth2 for all API calls. If you use Prompt flow’s OAuth flow it acquires tokens against Azure Open AI’s scope. And it’s tricky to attach a subscription key to its calls. It’s low-level but it’s causing friction between the application teams and the security team.
Prompt flow makes our authentication life a bit more difficult… We’re faced with a risk-based decision. Do we:
Allow Generative AI applications built with Prompt flow to take a less secure approach sending an APIm subscription-key
Allow Gen AI Apps to have direction permissions against Azure Open AI introducing a risk they could bypass our gateway protections and call Azure Open AI directly.
Introduce something ‘in the middle’ to adapt Prompt flow’s requests to meet our security requirements.
There are a few ways to approach this. Let us start with an approach that doesn’t add any new services. We are going to introduce a new product to our APIm AI Gateway which will:
Authorise the incoming JWT provided by Prompt flow and map the caller to a subscription key.
Make a new call back into APIm with the subscription key appended as a header to the original request
This will incur an impact on APIm’s performance so it’s not a free lunch. But it will enable us to reduce the risk of Prompt flow using just Api-Keys.
Great – what we can do now is:
Let our AI Applications use tokens acquired for Azure Open AI without giving them permissions to Azure Open AI
This is a nice trick here as we getting a token for AOAI doesn’t authorize us. Auth is done at the AOAI service.
The APIm product can:
Check the incoming claims on the JWT to authenticate the caller
Acquire a new token on-behalf-of the original (optional step)
Append a subscription key (using a pre-configured lookup)
Make the outbound call back to APIm where it will trigger the expected product behaviour.
For many organisations this might be enough. But there are still a few things we might care about. The OWASP LLM Top 10 identifies common vulnerabilities for LLM usage. Wouldn’t it be great if we can help our consumers detect some of them?
Bigger organisations tend to have AI Safety teams who need to report over all LLM consumption. They are asking questions like:
Can I get confidence prompts aren’t being jail-broken?
Can we be confident prompts are using grounding data, and not hallucinating?
Can we maintain a redacted audit log of all prompts and responses?
Let’s step into the shoes of our Responsible AI team…
Responsible AI team
Use Case:
So that my organisation stays within responsible AI boundaries
As a Responsible AI team
We want to spot check redacted prompt inputs, and outputs
Use Case:
So that my organisation stays within responsible AI boundaries
As a Responsible AI team
We want centralised alerting on attempted attacks on our Gen AI applications
Use Case:
So that my organisation stays within responsible AI boundaries
As a Responsible AI team
We want to know about CVE’s from LLM models and images our applications are using
We are going to reach for a few new tools to achieve these use-cases. AI Threat Detection from Azure Defender (https://learn.microsoft.com/en-us/azure/defender-for-cloud/ai-threat-protection) will help us with centralised alerts and CVE detection. Great news if we’re using Azure Open AI as it can take signals directly from the Content Safety layer meaning we don’t have to build anything to integrate the services.
For logging those prompts and responses we’re going to have to think outside the box. Most API Management platforms offer support for logging the bodies of requests and responses but there’s a caveat. They often don’t support streaming (Sever Side Events) which are used to provide better response latency to callers. If you’ve ever used Chat GPT you’ll have seen a streaming response in action. That ticker-tacker typewriter like experience when your response is returned is a Streaming Response.
API Management solutions currently buffer these responses in-order to log them which reduces the user-experience.
I’ll avoid the question of ‘do you need to use Server-Side Events’. It’s safe to say like all questions in IT, the answer is ’It Depends’. But what if you do need them?
To handle logging, I’ll introduce a second Gen AI Gateway into the architecture. It’s called AI Central https://github.com/microsoft/aicentral (there are lots of these out there – disclaimer, I am the primary maintainer of AI Central), and it runs as a Docker Container / Web API sitting in-front of APIm. Another good option is AI Sentry – https://github.com/microsoft/ai-sentry
AI Central will drop prompts and responses into a queue for PII redaction and logging. It works with streaming responses and doesn’t buffer so won’t degrade the end user experience. It currently uses the Azure Language service, but we are investigating using the PTU overnight when it’s not used as much, or PHI-3 models running in a sidecar.
AI Central logs PII Redacted prompts and responses to a Cosmos Database that a Responsible AI Team can use.
We’ve also enabled Azure Defender for AI. The AI Content Filter built into AOAI is transmitting data to Azure Defender looking for attacks on their LLM. This is all surfaced in the standard Azure Defender dashboards.
But before we call ‘done’, let’s face into our last hurdle. How do or Gen AI Application Engineers onboard to this platform?
Gen AI Platform Engineers… part II
Use Case:
So that I can simplify onboarding
As a Gen AI Platform Engineer
I want to streamline consumer onboarding
Use Case:
So that I can manage consumer demand
As a Gen AI Platform Engineer
I want a tool to simplify managing multiple Azure Open AI deployments
Use Case:
So that I can deploy on a Friday
As a Gen AI Platform Engineer
I want to deploy daily
How does a feature team express their requirements to a platform team? We are using a JSON document which could be added via a Pull Request into the Platform team’s repository. Something like this should suffice. It will capture enough information about the consumer to:
Understand their token / model requirements
Understand their content safety requirements
Understand when they want to promote into environments
Get in touch with them
Support Chargeback
{
“consumerName”: “consumer-1”,
“requestName”: “my-amazing-service”,
“contactEmail”: “engineer.name@myorg.com”,
“costCentre”: “92304”,
“constantAppIdIdentifiers”: [],
“models”: [
{
“deploymentName”: “embeddings-for-my-purpose”,
“modelName”: “text-embedding-ada-002”,
“contentSafety”: “high”,
“environments”: {
“dev”: {
“thousandsOfTokens”: 1,
“deployAt”: “2024-07-02T00:00:0000”
},
“test”: {
“thousandsOfTokens”: 1,
“deployAt”: “2024-07-02T00:00:0000”
},
“prod”: {
“thousandsOfTokens”: 15,
“deployAt”: “2024-07-02T00:00:0000”
}
}
},
{
“deploymentName”: “gpt35-for-my-purpose”,
“modelName”: “gpt-35-turbo”,
“contentSafety” : “high”,
“environments”: {
“dev”: {
“thousandsOfTokens”: 1,
“deployAt”: “2024-07-02T00:00:0000”
},
“test”: {
“thousandsOfTokens”: 1,
“deployAt”: “2024-07-02T00:00:0000”
},
“prod”: {
“thousandsOfTokens”: 15,
“deployAt”: “2024-07-02T00:00:0000”
}
}
}
]
}
This will be a conversation. When a consumer opens a Pull Request to add / update this information into the Platform repository use the Pull Request to query / suggest alternate deployment approaches (maybe gpt-4o is a better fit for their requirement then gpt-4).
When you are comfortable with a request, merge the Pull Request into the platform repository.
Platform Team Mapping
Now you have the feature teams’ requirements, as a Platform team you need to express these as Azure Open AI deployments and API Management Products. This is unlikely to be a one-to-one mapping.
For example, to maximise use of PTU you might want to consolidate multiple consumer demands into a single deployment. To maximise PTU further you will need to consolidate Content Filter policies into ‘low’, ‘medium’, or ‘high’ (as content filter policies have one-to-one affinity to a deployment).
These are all backend implementation decisions. Your contract to the consumer is to provide them access to Azure Open AI with the agreed deployment names, throughput, and content-filters, securing using their provided Entra Application Ids.
Platform Team Decisions
How many AOAI services will you need? AOAI has limits on quotas per region. For example, as of August 2024 Australia East allows you 350k tokens per minute for Text-Embedding-Ada-002. https://learn.microsoft.com/en-us/azure/ai-services/openai/quotas-limits
If you have more demand than available capacity for a region / subscription pair, then you will need to deploy more Azure Open AI resources to either different subscriptions, or different regions. The code sample provided uses a single subscription, scaling out over multiple regions.
We’ll leave the science of sizing PTU deployments out-of-scope for here. There’s lots of good documentation out there for sizing PTUs, e.g. https://techcommunity.microsoft.com/t5/ai-azure-ai-services-blog/right-size-your-ptu-deployment-and-save-big/ba-p/4053857.
Back to our platform let’s start simple – we will define a deployment to AOAI using this JSON. Our platform will be built using a list of these.
{
“aoaiName”: “graemeopenai”,
“deploymentName”: “testdeploy2”,
“enableDynamicQuota”: false,
“deploymentType”: “PAYG”,
“model”: “gpt-35-turbo”,
“modelVersion”: “0613”,
“thousandsOfTokensPerMinute”: 5
}
Finally, we need a mapping between the demands of our feature team to the platform deployments. We will use this JSON:
{
“consumerName”: “consumer-1”,
“requirements”: [
{
“outsideDeploymentName”: “text-embedding-ada-002”,
“platformTeamDeploymentMapping”: “text-embedding-ada-002”,
“platformTeamPoolMapping”: “graemeopenai-embedding-pool”
},
{
“outsideDeploymentName”: “gpt35”,
“platformTeamDeploymentMapping”: “testdeploy2”,
“platformTeamPoolMapping”: “graemeopenai-pool”
}
]
}
The platform uses APIm policies to rewrite consumer requests to the ‘outside’ deployment name to the actual deployment names, allowing multiple consumers to share single deployments. This mapping lets the Platform team introduce new deployments potentially with different names without affecting the consumers.
In future iterations we want to write a simple User Interface to help manage this mapping exercise.
Deployment
Deployment of a platform should be no different to deployment of a feature. The more you do it, the more confident you will be. The Bicep templates that form the platform (https://github.com/graemefoster/APImAIPlatform) are designed to deploy everything – Azure Open AI services, deployments, APIs, Products, AI Central, etc in a single “line-of-sight”.
Our entire platform is deployed using a single ‘az deployment sub create‘ command. If we had to rebuild the entire thing from scratch it would be the same single deployment command.
My recommendation is to deploy daily at least and potentially more frequently.
Optional Extras
In building this document some other ideas popped up that I think would be helpful in making it even easier and quicker for your internal customers to get onboard:
Providing platform endpoints to test your Prompts against well-known jail-breaks and other OWASP LLM threats
Running random checks against audited prompts and responses to check the grounded-ness of the responses
CI / CD Pipelines that ‘collect’ an application’s prompts and responses and pro-actively run guardrail evaluations over them
Think about all the requirements that your consumers need to ‘tick’ before they can deploy into production. Platforms succeed when their customers fall into the proverbial pit-of-success. The more general requirements you can automate, the more your engineers are going to love your platform.
Final thoughts
And that’s it. We’ve covered a lot of detail but have setup:
Azure Open AI Deployments
An API Management AI Gateway capable of authenticating, load balancing and enforcing consumer quota
An AI proxy that can provide advanced logging as-well as bridging libraries like Prompt flow which cannot easily send Entra JWT’s and APIm Subscription Keys
Discussed AI Threat Detection from Azure Defender to assist SIEM monitoring
Thought about future enhancements that could make your platform even more valuable.
What do you think? How are you approaching this problem? Try the sample. Make it better! We’d love to hear from you so please leave some feedback.
Microsoft Tech Community – Latest Blogs –Read More