AI-as-a-Service: Architecting GenAI Application Governance with Azure API Management and Fabric
The past year has seen explosive growth for Azure OpenAI and large language models in general. With models reliant on a token-based approach for processing requests, ensuring prompt engineering is being done correctly, tracking what models and api’s are being used, load balancing across multiple instances, and creating chargeback models has become increasingly important. The use of Azure API Management (APIM) is key to solving these challenges. There have been several announcements specific to the integration of Azure Open AI and APIM during Microsoft Build 2024 to make them easier to use together.
As the importance of evaluating analytics and performing data science against Azure Open AI based workloads grows, storing usage information is critical. That’s where adding Microsoft Fabric and the Lakehouse to the architecture comes in. Capturing the usage data in an open format for long term storage while enabling fast querying rounds out the overall solution.
We must also consider that not all use cases will require the use of a Large Language Model (LLM). The recent rise of Small Language Models (SLM), such as Phi-3, for use cases that do not require LLMs, means there will very likely be multiple types of Generative AI (GenAI) models in use for a typical enterprise and they will all be exposed through a centrally secured and governed set of APIs enabling every GenAI use case for rapid onboarding and adoption. Having an AI Center of Enablement framework providing “AI-as-a-Service” will be incredibly important for organizations to safely enable different GenAI models quickly and their numerous versions all within the allocated budget or by using a chargeback model that can span across the enterprise regardless of the number of teams consuming the AI services and the number of subscriptions or environments they end up requiring.
This model will also allow organizations to have complete consumption visibility if they purchase Provisioned Throughput Units (PTU) for their GenAI workloads in production (at scale with predictable latency and without having to worry about noisy neighbors) when each of the individual AI use cases/business units are not able to purchase it entirely on their own. This true economy of scale can be achieved with this same architecture where PTU is purchased for a particular Azure OpenAI model deployment and is shared among all Business-critical production use cases.
The overall architecture for this “AI-as-a-Service” solution is as follows:
Flow:
A client makes a request to an AI model through Azure API Management using a subscription key that is unique to them. This allows multiple clients to share the same AI model instance and yet we can uniquely identify each one of them. Clients could be different Business Units or Internal/External consumers or product lines.
Azure API Management forwards the request to the AI model and receives the output of the model.
Azure API Management logs the subscription details and request/response data to Event Hubs using a log-to-eventhub policy.
Using the Realtime Intelligence experience in Microsoft Fabric, an Eventstream processor reads the data from Event Hubs.
The output of the stream is written to a managed Delta table in a Lakehouse.
After creating a view of the Delta table in the Sql Analytics endpoint for the Lakehouse, it can now be queried by Power BI. We can also use a Notebook to perform any data science requirements against the prompt data
Build out
Create an Event Hub logger in API Management.
In the API that exposes AI backend, add policy that sends the data to the event hub. This example shows Azure OpenAI as the backend.
<policies>
<inbound>
<base />
<authentication-managed-identity resource=”https://cognitiveservices.azure.com” output-token-variable-name=”msi-access-token” ignore-error=”false” />
<set-header name=”Authorization” exists-action=”override”>
<value>@(“Bearer ” + (string)context.Variables[“msi-access-token”])</value>
</set-header>
<set-variable name=”requestBody” value=”@(context.Request.Body.As<string>(preserveContent: true))” />
</inbound>
<backend>
<base />
</backend>
<outbound>
<base />
<choose>
<when condition=”@(context.Response.StatusCode == 200)”>
<log-to-eventhub logger-id=”ai-usage”>@{
var responseBody = context.Response.Body?.As<string>(true);
var requestBody = (string)context.Variables[“requestBody”];
return new JObject(
new JProperty(“EventTime”, DateTime.UtcNow),
new JProperty(“AppSubscriptionKey”, context.Request.Headers.GetValueOrDefault(“api-key”,string.Empty)),
new JProperty(“Request”, requestBody),
new JProperty(“Response”,responseBody )
).ToString();
}</log-to-eventhub>
</when>
</choose>
</outbound>
<on-error>
<base />
</on-error>
</policies>
Build an Eventstream in Fabric that lands the data into the Delta table.
The data comes across a bit too raw to use for analytics, but with the SQL Analytics endpoint, we can create views overtop of the table.CREATE OR ALTER VIEW [dbo].[AIUsageView] AS
SELECT CAST(EventTime AS DateTime2) AS [EventTime],
[AppSubscriptionKey],
JSON_VALUE([Response], ‘$.object’) AS [Operation],
JSON_VALUE([Response], ‘$.model’) AS [Model],
[Request],
[Response],
CAST(JSON_VALUE([Response], ‘$.usage.completion_tokens’) AS INT) AS [CompletionTokens],
CAST(JSON_VALUE([Response], ‘$.usage.prompt_tokens’) AS INT) AS [PromptTokens],
CAST(JSON_VALUE([Response], ‘$.usage.total_tokens’) AS INT) AS [TotalTokens]
FROM
[YOUR_LAKEHOUSE_NAME].[dbo].[AIData]
We can now create a report using a DirectLake query from Power BI
We can also load the data into a Spark dataframe to perform data science analysis on the prompts and responses.
You can find more detailed instructions on building this on our GitHub sample.
A Landing Zone Accelerator is also available that shows how to build the underlying foundation infrastructure in an enterprise way.
Alternative Designs
Azure Cosmos DB for NoSQL to persist Chat History – If your application is already storing Chat history (prompts & completions) in Azure Cosmos DB for NoSQL, you don’t need to log the requests and responses to Event Hub from APIM policy again. In that case, you can simply log the key metrics to Event Hub (e.g. Client Identifier, Deployment Type, Tokens consumed etc.) and source the prompts and completions from Cosmos DB for advanced analytics. The new preview feature of Mirroring a Cosmos DB can simplify this process.
Here is a code sample to parse the response body and log the token consumption through APIM policies.
<log-to-eventhub logger-id=”ai-usage”>@{
return new JObject(
new JProperty(“TotalTokens”, context.Response.Body.As<JObject>(preserveContent: true).SelectToken(“usage.total_tokens”).ToString())
).ToString();
}</log-to-eventhub>
Once the raw token counts and API consumer (e.g. different Business Units using the AI-as-a-Service model) info is logged into Event Hub and it makes its way into Fabric Lakehouse, aggregate measures can be created directly on top of the Semantic model (default or custom) and displayed in a Power BI dashboard of your choice. An example of such aggregate measure is as follows:
TokensByBU = CALCULATE(SUMX(
aoaichargeback,
VALUE(MAX(aoaichargeback[TotalTokens]))
),
ALLEXCEPT(aoaichargeback, aoaichargeback[BusinessUnitName]))
Here aoaichargeback is the name of the Lakehouse table where all events emitted from APIM are stored. TokensByBU measure calculates the sum of the maximum TotalTokens value for each BusinessUnitName in the aoaichargeback table.
Since both the chat history data and the key usage/performance metrics is in Lakehouse, they can be combined & used for any advanced analytical purposes. Similar approaches (earlier in the Article) of utilizing the Fabric Lakehouse SQL Analytics endpoint can be used for analyzing and governing the persisted data.
2. Azure OpenAI Emit Token Metric Policy – With the recent announcement of GenAI Gateway capabilities in Azure API Management – a set of features designed specifically for GenAI use cases, we can now get key Azure OpenAI consumption metrics straight out of our App Insight namespace when this feature is enabled and implemented. A new policy <azure-openai-emit-token-metric> can now be used for sending the Azure OpenAI token count metrics to Application Insights along with User ID, Client IP, and API ID as dimensions.
Microsoft Tech Community – Latest Blogs –Read More