AI+API better together: Benefits & Best Practices using APIs for AI workloads
This blog post will give you an overview of benefits and best practices you will get harnessing APIs and an API Manager solution when integrating AI into your application landscape.
Adding Artificial intelligence (AI) to existing applications is becoming an important part in application development. The correct integration of AI is vital to meet business goals, functional and non-functional requirements and build applications that are efficient to maintain and enhance. APIs (Application Programming Interfaces) play a key part in this integration and an API Manager is fundamental to keep control of the usage, performance, and versioning of APIs – especially in enterprise landscapes.
Quick Refresher: What is an API & API Manager?
An API is a connector between software components, promoting the separation of components by adding an abstraction layer, so someone can interact with a system and use its capability without understanding the internal complexity. Every AI service we leverage is accessed via an API.
An API Manager is a service that manages the API’s lifecycle, acts as single point of entry for all API traffic, and is a place to observe APIs. For AI workloads it is an API gateway that sits between your intelligent app and the AI endpoint. Adding an API Gateway in front of your AI endpoints is a best practice to add functionality without increasing the complexity of your application code. You also create a continuous development environment to increase the agility and speed of bringing new capabilities into production while maintaining older versions.
This blog post will show the benefits and best practices of AI + APIs in 5 key areas:
Performance & Reliability
Security
Caching
Sharing & Monetization
Continuous Development
The best practices in bold are universal and apply to any technology. The detailed explanation focuses on the features of Azure API Management (APIM) and the Azure services surrounding it.
1. Performance & Reliability
If you aim to add AI capability to an existing application, it feels the easiest to just connect an AI Endpoint to an existing app. In fact, a lot of tutorials use this scenario.
While this is a faster setup at the beginning, it leads to challenges and code complexity eventually once application requirements increase or multiple applications use the same AI service. With more calls targeting an AI Endpoint, performance, reliability, and latency will become requirements. Azure AI services have limits and quotas, but exceeding those limits will lead to error responses or unresponsive applications. To ensure a good user experience in production workloads, an API manager between the intelligent app and the AI Endpoint is a best practice.
Azure APIM, acting as an AI Gateway, provides load balancing and monitoring of AI Endpoints to guarantee a consistent and reliable performance of your deployed AI models and your intelligent apps. For the best result, multiple instances of an AI model should be deployed in parallel for requests to be distributed evenly (see Figure 2). The number of instances depends on your business requirements, use cases and forecasted peak traffic scenarios. You can route the traffic randomly or via round robin to load balance it evenly. For a more targeted routing you can distribute traffic.
Distributing requests across multiple AI instances is more than just load balancing. Using built in-policies or writing custom policies in Azure APIM, enables you to route traffic to selected Azure AI Endpoints or forward traffic to a more regional endpoint closer to the user’s location. For more complex workloads, the use of backend pools can add value (see Figure 3). A backend pool defines a group of resources which can be targeted depending on their availability, time to respond or workload. APIM can distribute incoming requests across them based on patterns like the circuit breaker pattern, preventing applications from repeatedly trying to execute an operation that’s likely to fail. Both ways of distribution are a good practice to ensure optimal performance and reliability in case of planned outages (upgrades, maintenance) or unplanned outages (power outages, natural disasters), high traffic scenarios or data residency requirements.
Another method to keep performance high and requests under control is by adding a rate limiting pattern to throttle traffic to AI models. Limiting access by time, IP address, registered API consumer or API key allows you to protect the backend against volume bursts as well as potential denial of service attacks. Applying an AI token-based limit as a policy is a good practice to define throttling tokens per minute and restrict noisy neighbours (see Figure 4).
But rate limiting and load balancing are not enough to ensure high performance. A consistent monitoring of workloads is a fundamental part of operational excellence. This includes health checks of endpoints, connection attempts, request times or failure counts. Azure API Management can help to keep all information in one place by storing analytics and insights of requests in a Log Analytics workspace (see Figure 4). This allows you to gain insights into the usage and performance of the APIs, API operations and how they perform over time, or in different geographic regions. Adding Azure Monitor to the Log Analytics workspace allows you to visualize, query, and archive data coming from APIM, as well as trigger corresponding actions. These actions can be anomaly alerts sent to API operators via push notification, email, SMS, or voice messages on any critical event.
2. Security
Protecting an application is a key requirement for all businesses to prevent data loss, denial of service attacks or unauthorized data access. Security is a multi-layer approach including infrastructure, application, and data. APIs act as one security layer to provide input security, key management, access management, as well as output validation in a central place.
While there is not one right way of adding security, adding input validation at a central point is beneficial for easy maintenance and fast adjustment when it comes to new vulnerabilities. For external facing applications this should include an Application Firewall in front of APIM. Input validation in Azure includes that APIM scans all incoming requests based on rules and regular expressions to protect the backend against malicious activities and vulnerabilities such as SQL injection or cross site scripting. That allows only valid requests to be processed by the AI Endpoints. Validation is not limited to input but can also be used for output security, preventing data to be exposed to external or unauthorized resources or users (see Figure 5).
Access management is another pillar of security. To authenticate access to Azure AI endpoints you can provide the API keys to the developers and give them direct access to the AI Endpoints. This, however, leaves you out of control who is accessing your AI models. A better option is to store API keys in a central place like Azure APIM and create an input policy. Access is then restricted to authorized users and applications.
Microsoft Entra ID (formerly Azure Active Directory) is one cloud based identity and access management solution that can authenticate users and applications by SSO (Single-Sign-on) credentials, password, or an Azure managed identity. For more fine-grained access and as part of a defence-in-depth strategy a backend authorization with OAUTH 2.0 and JWT (JSON web token) validation is a good practice to add.
Within Azure API Management you can also fine tune access rights per user or user groups by adding RBAC (Role based access control). It is good practice to use the built-in roles as a starting point to keep the number of roles as low as possible. If the default roles do not match your company’s needs, custom roles can be created and assigned. Adding users to groups and maintaining the access rights at the group level is another good practice as it minimizes maintenance efforts and increases structure.
3. Caching
Do you have an FAQ page that covers the most common questions? If you do, you likely created it to lower costs for the company and save time for the user. A response cache works the same way and can store previously requested information in memory for a predefined time and scope. Information that does not change frequently or contain sensitive information can be stored and reused. When using a cache, every request from the front-end is analysed semantically to check if an answer is available in the cache. If the semantic search is successful, the response from the cache will be used otherwise the request is forwarded to the AI model and the response is sent to the requesting application and stored in the cache if requirements for caching are met.
There are different caching options (see Figure 7): (1.) Inside the Azure APIM cache for simple use case scenarios or (2.) in an external cache like Redis Cache for more control over the cache configurations.
To get insights into frequently asked questions and cache usage, analytics data can be collected with Application Insights and visualized in real time using Grafana Dashboards. This allows you to identify trends in your intelligent apps and share insights for application improvement with decision makers and model fine tuning with engineering teams.
4. Sharing, Chargeback & Monetization
Divide and conquer is a common IT paradigm which can help you with your AI use cases. Sharing content and learnings across divisions rather than working in isolation and repeating similar work increases the speed of innovation and decreases the costs for development of new IP (intellectual property). While this is not possible in every company, most organizations would welcome a more collaborative approach especially when developing and testing new AI use cases. Developing tailored AI components in a central team and reuse them throughout the company will add speed and agility. But how do you track the usage across all divisions and share costs?
Once you have overcome the difficult cultural aspect of sharing information across divisions, charging back costs will be mainly an engineering problem. With APIM, you can bill and chargeback per API usage. Depending on how you want to chargeback or monetize your AI capability, you have different billing methods to choose from; Subscription and Metered. With Subscription billing, the user pays a fixed fee upfront and uses the service according to the terms and condition, like a video streaming service. This billing model gives you, as the API owner, a predictable income and capacity planning. Conversely, with Metered billing, the user pays according to the frequency of their activity, similar to an energy bill. This option gives the user more freedom to only pay what they use but it is more suited for organisations with highly scalable infrastructure set ups, as Metered billing can make scaling out AI instances more complex.
Monitoring the analytics of each call can help with scaling and optimization. Without accessing the content itself monitoring gives you a powerful tool to track real time analytics. Through outbound policies the analytics data can be streamed with Event Hub to PowerBI to create real time dashboards or to Application Insights to view the token usage for each client (see Figure 8). This information can help to automate internal chargeback or generate revenue by monetizing IP. An optional integration of 3rd party payment providers facilitate the payments. This solves the cost question. But once you share your IP widely, how can you ensure high performance for all users?
Limiting the requests per user, token or time (as explained in the Performance section) controls how many requests a user can send based on policies. This gives each project the right performance for the APIs they use. Sending the requests to a specific AI instance based on the project’s development stage, helps you to balance performance and costs. For example, Dev/Test workloads can be directed to the less expensive Pay as you Go instances when latency is not critical, while production workloads can be directed to AI Endpoints that use provisioned throughput units (PTUs) (see Figure 9). This allocated infrastructure is ideal for production applications that need consistent response times and throughput. By using the capacity planner to plan the size for your PTU, you will have a reserved AI instance that suits your workloads. Future increases in traffic can be routed to either another PTU instance or a Pay as you go instance in the same region or another one.
5. Continuous Development
Keeping up with the quick evolving AI Models is challenging as new models come to the market within months. Companies need to choose with every new model available, if they want to stay on a former version or use the newest for their use case. To keep the development lifecycle most efficient, it is a good practice to have separate teams focusing on parts of the application; Divide & conquer. This can mean a parallel development of the consuming application and the corresponding AI capability within a project team or a central AI team sharing their AI capability with the wider company. For either model, using APIs to link the parts is paramount. But the more APIs created, the more complex the API landscape becomes.
A single API Manager is a best practice to manage and monitor all created APIs and provide one source of information for sharing APIs with your developers, allow them to test the API operations and request access. The information shared should include an overview of the available API versions, revisions, and their status, so developers can track changes and switch to a newer API version when needed or convenient for their development. A roadmap is a nice-to-have feature if your development team is comfortable sharing their plans.
While such an overview of APIs can be created anywhere and is still often seen in Wikis, it is best to keep the documentation directly linked to your APIs, so it stays up to date. Azure APIM automatically creates a so called Developer Portal, a customizable webpage containing all the details about the APIs in one place reflecting changes made in APIM immediately, as the two services are linked (see Figure 10). This additional free of charge portal provides significant benefits to API developer and API consumer. The API consumer can view the APIs, the documentation, and conduct tests of all API operations visible to them. The API developer can share additional business information, set up a fine granular access management for the portal and track API usage to get an overview which API versions are actively used and when it is safe to retire older versions or provide long-term support.
Application development is usually brown field, with existing applications or APIs deployed in different environments or on multiple clouds. APIM supports the Import of existing OpenAPI specification and other APIs to facilitate adding all APIs into one API Management. The APIM instances can then be deployed on Azure or other cloud environments as managed service. This allows you and your team to decide when to move workloads if wanted or needed.
Summary
AI-led applications usher in a new era of working, and we’re still in its early stages. This blog post gave insights why AI and APIs are a powerful combination, and how an API Manager can enhance your application to make it more agile, efficient, and reliable. The best practices I covered on performance, security, caching, sharing, and continuous development are based on Microsoft’s recommendations and customer projects I’ve worked on across various industries in the UK and Europe. I hope this guide will help you to design, develop, and maintain your next AI-led application.
Microsoft Tech Community – Latest Blogs –Read More