Expanding GenAI Gateway Capabilities in Azure API Management
In May 2024, we introduced GenAI Gateway capabilities – a set of features designed specifically for GenAI use cases. Today, we are happy to announce that we are adding new policies to support a wider range of large language models through Azure AI Model Inference API. These new policies work in a similar way to the previously announced capabilities, but now can be used with a wider range of LLMs.
Azure AI Model Inference API enables you to consume the capabilities of models, available in Azure AI model catalog, in a uniform and consistent way. It allows you to talk with different models in Azure AI Studio without changing the underlying code.
Working with large language models presents unique challenges, particularly around managing token resources. Token consumption impacts cost and performance of intelligent apps calling the same model, making it crucial to have robust mechanisms for monitoring and controlling token usage. The new policies aim to address challenges by providing detailed insights and control over token resources, ensuring efficient and cost-effective use of models deployed in Azure AI Studio.
LLM Token Limit Policy
LLM Token Limit policy (preview) provides the flexibility to define and enforce token limits when interacting with large language models available through the Azure AI Model Inference API.
Key Features
Configurable Token Limits: Set token limits for requests to control costs and manage resource usage effectively
Prevents Overuse: Automatically blocks requests that exceed the token limit, ensuring fair use and eliminating the noisy neighbour problem
Seamless Integration: Works seamlessly with existing applications, requiring no changes to your application configuration
Learn more about this policy here.
LLM Emit Token Metric Policy
LLM Emit Token Metric policy (preview) provides detailed metrics on token usage, enabling better cost management and insights into model usage across your application portfolio.
Key Features
Real-Time Monitoring: Emit metrics in real-time to monitor token consumption.
Detailed Insights: Gain insights into token usage patterns to identify and mitigate high-usage scenarios
Cost Management: Split token usage by any custom dimension to attribute cost to different teams, departments, or applications
Learn more about this policy here.
LLM Semantic Caching Policy
LLM Semantic Caching policy (preview) is designed to reduce latency and reduce token consumption by caching responses based on the semantic content of prompts.
Key Features
Reduced Latency: Cache responses to frequently requested queries based to decrease response times.
Improved Efficiency: Optimize resource utilization by reducing redundant model inferences.
Content-Based Caching: Leverages semantic similarity to determine which response to retrieve from cache
Learn more about this policy here.
Get Started with Azure AI Model Inference API and Azure API Management
We are committed to continuously improving our platform and providing the tools you need to leverage the full potential of large language models. Stay tuned as we roll out these new policies across all regions and watch for further updates and enhancements as we continue to expand our capabilities. Get started today and bring your intelligent application development to the next level with Azure API Management.
Microsoft Tech Community – Latest Blogs –Read More