How to achieve high HTTP scale with Azure Functions Flex Consumption
Taking Azure Functions from 0 to 32,000 RPS in 7 seconds
Consider a connected car platform that processes data from millions of cars. Or a national retailer running pop-up campaign that processes pre-orders. Or a healthcare provider calculating big data analytics. All of these can have variable load requirements — from zero to tens of thousands of requests per second (RPS). The serverless model has grown rapidly as developers increasingly run event-triggered code as a service, pushing platform limits, and Azure Functions customers now want to orchestrate complex serverless solutions and expect high throughput.
This feedback led us to revamp the Azure Functions platform architecture to help ensure that it meets our customers’ most demanding performance requirements. As this article describes:
We have introduced the new Azure Functions Flex Consumption plan that you can use to achieve high-volume HTTP RPS while optimizing costs.
You can customize the per instance concurrency of HTTP-triggered functions and choose between instance memory sizes to fit your throughput and cost requirements.
We demonstrate achieving 32,000 RPS in 7 seconds with a sample retail customer flash sale case study, with a .NET HTTP triggered function app sending to Event Hubs through a VNet.
We demonstrate achieving 40,000 RPS with 1,000 instances in less than a minute with a Python app with per-instance concurrency of 1.
Understanding concurrency driven scaling
Per-instance concurrency is the number of parallel requests that each instance of your app can handle. In Azure Functions Flex Consumption, we’ve introduced deterministic concurrency for HTTP. All HTTP-triggered functions in your Flex Consumption app are grouped and scaled together in the same instances, and new instances are allocated based on the HTTP concurrency configured for your app. Per-instance concurrency is vital to great performance and it’s important to configure the maximum number of concurrent workloads that can be processed at the same time by a given instance. With higher concurrency, you can push more executions through and potentially pay less.
To show how this works using an example, imagine that 10 customers select the shopping cart at the same time on an e-commerce website, sending 10 requests to a function app. If concurrency is set to 1 and the app is scaled down to zero, the platform will scale the app to 10 instances and run 1 request on each instance. If you change concurrency to 2, the platform scales out to five instances, and each handles two requests.
In general, you can trust the default values to work for most cases and let Azure Functions scale dynamically based on the number of incoming events. I.e., Flex Consumption already provides default values that make the best of each language’s capabilities. For Python apps, the default concurrency is 1 for all instance sizes. For other languages, the 2,048 MB instance size uses a default concurrency of 16 and the 4,096 MB uses 32. In any case, you have the flexibility to choose the right per-instance settings for your workload.
You can change the HTTP concurrency using the Azure CLI’s trigger-type and perInstanceConcurrency parameters:
az functionapp scale config set -g <RESOURCE_GROUP> -n <FUNCTION_APP_NAME> –trigger-type http –trigger-settings perInstanceConcurrency=<CONCURRENCY>
This is also possible from the Azure Portal on the new Scale and Concurrency settings for Flex Consumption apps in the Concurrency per instance section:
Concurrency and instance memory sizes
Currently, Flex Consumption supports two instance memory sizes: 2,048 MB, and 4,096 MB, with more instance sizes to be added in the future. The default is 2,048 MB. Depending on your workload you can benefit from a larger instance size, which can potentially handle more concurrency or heavier workloads as well. To create your app with a different instance memory size, simply include the instance-memory parameter:
az functionapp create -g <RESOURCE_GROUP> -n <FUNCTION_APP_NAME> -s <STORAGE_ACCOUNT_NAME> –runtime <RUNTIME> –runtime-version <RUNTIME_VERSION> –flexconsumption-location “<AZURE_REGION>” –instance-memory <INSTANCE_MEMORY>
You can also change the instance memory size in the Azure Portal when creating the app, or from the same Scale and Concurrency settings mentioned above after the app is created.
Not all hosting providers support per-instance concurrency higher than 1, even if some workloads would benefit from it. If your function app doesn’t have compute-intensive operations, per-instance concurrency control may be very helpful. I.e., running four operations concurrently while paying the same is better than paying for one operation at a time.
Cold Start
It’s worth noting that when you set concurrency to a value higher than 1, you also reduce the cold start penalty for those concurrent executions. We recently wrote about the improvements in Azure Functions overall to reduce cold starts (Azure Functions cold start improvement). In Flex Consumption you can also help ensure that a minimum number of instances are always running and available. The new always ready feature keeps a select number of instances warm for your functions.
Protecting downstream components
In addition to concurrency and instance size, you need to consider whether a downstream component has limited throughput capacity, like a database or an API that your function calls. You can change the maximum number of instances that your Flex Consumption app scales to by modifying the maximum instance count setting. You can set it to a valid value between 40 (the lowest value for maximum instance count) and 1,000 (the maximum). For example, in Azure CLI:
az functionapp scale config set -g <RESOURCE_GROUP> -n <FUNCTION_APP_NAME> –maximum-instance-count <SCALE_LIMIT>
You can also change this in the Azure Portal from the same Scale and Concurrency settings mentioned above.
Case study: HTTP endpoint writing to Azure Event Hubs
A retail customer asked us to help with a project to handle a flash online promotion projected to receive a peak of 2 million HTTP requests per minute (approximately 35,000 RPS). A function app was used for the ingestion of contact information from interested buyers. The web component of this solution was hosted by a third party that could only forward a buyer’s contact information via HTTP. Our customer wanted to protect the incoming data using Azure managed identities and to forward it for downstream processing to Azure Event Hubs secured behind a virtual network.
We developed a sample that implements the basics of this scenario and ran it through a suite of performance tests. You can deploy and test the High scale HTTP function app to Event Hubs via VNet sample yourself.
Initial test setup
The application was deployed into Flex Consumption with the following settings:
Instance memory size set to 2,048 MB
Maximum instance count set to 100
HTTP concurrency set to the system assigned default of 16 concurrent requests per instance
1,000 concurrent clients, calling the HTTPS endpoint with a HTTP post, using Azure Load Testing, for three minutes
Results
The application achieved an average throughput of 15,630 requests per second.
The application handled almost 3 million requests in total during this three-minute test. Azure Load Testing reports the following latency distribution.
Request count
Latency P50
Latency P90
Latency P99
Latency P99.9
2969090
50 ms
96 ms
166 ms
2188 ms
We can analyze the scale-out behavior by looking at our logs in Application Insights. This query counts how many different instances emitted a log for each second of the test—the application was successfully executing across 80 instances within 10 seconds of the workload starting.
requests
| where timestamp >= datetime(2024-05-06T01:15:00.0000000Z) and timestamp <= datetime(2024-05-06T01:25:00.0000000Z)
| summarize dcount(cloud_RoleInstance) by bin(timestamp, 1s)
| render columnchart
Test variations
We then made some modifications to the setup to push the application performance higher, with a tradeoff on cost. We ran the same client load but with the following server configuration changes:
Updated maximum instance count to 500 (and regional subscription memory quota raised accordingly)
Separate test runs with HTTP per-instance concurrency set to 8 and 4
With Azure Load Testing, you can compare your runs, so we compared the concurrency values of 16, 8, and 4 directly.
As the chart shows, dropping the concurrency to 4 really made a difference for this workload, pushing the throughput well above 32,000 RPS. This result correlates to the reduced latency numbers—just under 6.6 million requests in three minutes with a P50 latency of 23 milliseconds.
Latency profile with HTTP Concurrency = 4
Here are the latency percentiles breakdown of the HTTP concurrency = 4 run:
Request count
Latency P50
Latency P90
Latency P99
Latency P99.9
6,596,600
23ms
39ms
88ms
172ms
With each instance handling fewer requests, we see a corresponding increase in the instance count with HTTP Concurrency of 4. This also translates into faster scaling, with the system scaling out to 250 instances within 7 seconds.
requests
| where timestamp >= datetime(2024-05-06T02:30:00.0000000Z) and timestamp <= datetime(2024-05-06T02:40:00.0000000Z)
| summarize dcount(cloud_RoleInstance) by bin(timestamp, 1s)
| render columnchart
Tuning for performance versus cost
The following table compares the overall performance and cost of these runs. Learn more about Flex Consumption billing meters.
Concurrency
configuration
Request count
RPS
GB seconds
Cost in USD
GB-sec cost per
1 million requests
16 (default)
2,969,090
15,630
28,679
$0.4588
$0.1545
8
3,888,524
20,358
40,279
$0.6444
$0.16573
4
6,596,600
32,980
93,443
$1.4951
$0.2266
The total cost of these runs went up as we lowered the concurrency because it reduced the latency, allowing Azure Load Testing to send more requests during the three-minute interval. The last column shows the normalized cost per 1 million requests, indicating that the better performance from a lower concurrency value comes at a higher cost.
We recommend performing this type of analysis on your own workload to determine which configuration best suits your needs. As the results for higher concurrency demonstrate, using per-instance concurrency in Azure Functions workloads can really help you reduce costs. A great way to accomplish this is by taking advantage of the integration between Azure Functions and Azure Load Testing.
Scale to 1000 instances in less than a minute
The Event Hub case study above demonstrates the cost savings you can unlock if your workload can take advantage of concurrency – but what about workloads that cannot? We have made heavy investments in optimizing the system to work well when the per-instance concurrency is set to 1.
Test Setup
Python function app with concurrency set to 1, instance size set to 2,048 MB
Workload is a mix of IO and CPU – an http triggered function that receives a 73 KB HTML document and then parses it
Updated maximum instance count to 1,000 and regional subscription memory quota raised accordingly
1,000 concurrent clients, calling the HTTPS endpoint with an HTTP post, using Azure Load Testing, for five minutes
Results
The system stabilizes at 40K RPS in less than a minute. The following chart shows our 4 most recent runs at time of writing:
Latency profile
Here are the latency percentiles breakdown of this HTTP concurrency = 1 run:
Request count
Latency P50
Latency P90
Latency P99
Latency P99.9
12,567,955
20ms
34ms
59ms
251ms
The system achieves this performance by scaling up to ~975 instances within 1 minute. The only reason it did not reach exactly 1000 is that slightly more than 1000 concurrent clients are needed to push it that far, due to network travel time. Here is the first minute of scaling activity:
requests
| where timestamp between (todatetime(‘2024-06-06T00:57:00Z’) .. 1m)
| summarize dcount(cloud_RoleInstance) by bin(timestamp, 1s)
| render columnchart
You will notice that the scaling is not linear – the system added instances much more rapidly during the first 10 seconds and then gradually de-accelerated. This behavior is by design – we believe this approach hits a sweet spot of delivering great burst scale performance while reducing the degree of unnecessary over-scaling. If you find that this scaling pattern does not work well for your workload, please let us know.
Compute injected into your virtual network, in milliseconds
We’ve introduced improved virtual network features to Azure Functions Flex Consumption. Your function app can reach services secured behind a virtual network (VNet) and can also be secured to your virtual network with service or private endpoints. But more importantly, your function apps can reach services that are restricted to a virtual network without sacrificing scale-out speed and with scale-to-zero.
Our customer scenario used a VNet to allow the function app to write to an event hub that had no public endpoint. You might be wondering whether this VNet injection comes at a performance cost in terms of startup latency. This latency matters not only when your application is fully idle and scaled to zero (cold start) but also when the application needs to scale out quickly.
To best answer the question for our customer, we ran a series of benchmarks on the simplest possible workload—an application with an HTTP endpoint that returns a static 200 response. We compared the startup performance with and without VNet integration and ran these tests with the following configurations in mind:
Significant load to force scaling out too many instances
Coverage across multiple language stacks (Python 3.11, Java 17, .NET 8, Node.js 20)
Coverage across six different regions
Thirty-two unique test pairs with the exact same configuration, except whether VNet injection was enabled
We collected just under 30,000 data points across a five-day period and measured the time taken to get the first response from each allocated instance:
Configuration
Sample count
Latency P50 (ms)
Latency P90 (ms)
Latency P99 (ms)
No VNet
15,048
435
1,217
2,007
VNet integrated
14,408
472
1,357
3,307
Our test findings demonstrate that enabling VNet injection has a very low impact on your scale-out performance. We observe that 37ms at the 50th percentile is a reasonable cost to pay for the added security benefits of using virtual networks with Flex Consumption. These performance numbers for VNet injection are due to the deep investment we have made into the networking stack of Project Legion, which is the compute substrate for Flex Consumption.
Troubleshooting
We’ve touched on a few different configuration settings you need to keep in mind when running high throughput workloads on Flex Consumption, so here’s a checklist we suggest working through if you’re struggling to reach your performance goals:
Max instance count – verify that you’ve raised the maximum instance count to an appropriate value.
Regional subscription memory quota – if you have multiple Function Apps on the same subscription running in the same region, they share this quota. This means that one app might not scale out to the desired size if another app is already running at significant scale. If you need it raised, file a support ticket.
Monitor application insights for signs of downstream bottlenecks – during earlier iterations of the Event Hub case study we did not have the Event Hub scaled out sufficiently, and so we were encountering transient “the request was terminated because the entity is being throttled” errors, which were visible in the traces table in Application Insights.
Final thoughts
We’re proud of the performance enhancements in Azure Functions Flex Consumption. As one participant of our preview program said, “I’ve never seen any Azure service scaling like this! This is what we’ve been missing for a long time.”
To learn more and share feedback:
Learn more about Azure Functions Flex Consumption.
Deploy and run your own workloads to Flex Consumption, or try one of our samples.
Share your feedback about Azure Functions Flex Consumption scale. Your feedback and insights will be crucial in refining and enhancing this feature.
Microsoft Tech Community – Latest Blogs –Read More