A/B Testing, Session Affinity & Regional Rules for Multi-region AKS clusters with Azure Front Door
In this article we will explore how A/B testing in multi-region environments can be performed leveraging Front Door session affinity and an ingress controller to ensure consistent user pools as we scale up our traffic. We will also explore how we can use origin group rewrite rules on existing paths to ensure traffic is routed for specific user sets to specific locations.
Azure Front Door Rulesets and Session Affinity
Azure Front Door is a content delivery network (CDN) that provides fast, reliable and secure access between users and applications using edge locations across the globe. Front Door, in this instance is used to route traffic across the globe between the two regionally isolated AKS clusters. Front Door also supports usage of a Web Application Firewall, custom domains and rewrite rules and more.
Rewrite rules can be thought of as rule sets, these rule sets can evaluate and perform actions on any request according to certain properties or criteria. For example we could create an evaluation on the address of a request such as a “geomatch” and pair that with one or multiple actions. In Front Door we have multiple actions that can be used including modifying request headers, response headers, redirects, rewrites and route overrides. For example we in this case may want to use a route configuration action to ensure that anything every request that originates from a UK location will be routed to the UK origin group.
Front Door has a number of routing methods available that are set at the origin group level. Most people are familiar with latency based routing which involves routing the incoming request to the origin with the lowest latency, usually the origin in closest proximity to the user. Azure Front Door also supports weighted traffic routing at the origin group level which is perfect for A/B testing. In a weighted traffic routing method the traffic gets distributed with a round-robin mechanism using the ratio of the weights specified. It is important to note that this still honours the “acceptable” latency sensitivity set by the user. If the latency sensitivity is set to 0, the weighted routing will not take effect unless both origins have the exact same latency.
Although Front Door offers multiple traffic routing methods when rolling out A/B testing we may want to be more granular with which users or requests are landing on our test origin. Let’s say for example we initially only route internal customers to a certain app version on a specific cluster based on the request IP or perhaps only a certain request protocol to a specific version of our API on a cluster. In these cases rule sets can be implemented to give us granular controls of the users or requests that are being sent to our test application.
Using rewrite rules will involve multiple origin groups. We could create an origin group per region that may hold route’s specific for the applications that are regional as well as a shared services cluster that will have both regions origins for services that can be accessed regardless of the users location. There are some benefits to this group split.
Resiliency – By splitting our origin groups up in such a way we maintain the multi-region resiliency for the services that support it. If the East US cluster/s go down only regional services are effected. While DR takes place for shared services users can still access the UK south cluster.
Data Protection – For stateful services that have stringent data requirements we can ensure that users are not routed to a service that is not suitable even when using weighted routing as we can apply our rulesets.
Limitation of multiple Routes for one path – Front Door does not allow multiple identical route paths. Paths are also limited to only one origin group. If we use an example of a route “/blue” that is across both clusters it will have to exist and be associated only with the “services-shared” origin group, however using re-write rules we can reroute the request to an origin group of our choice such as “services-uksouth”.
It is worth being aware that when creating origin groups the hard limit is 200 origin groups per Front Door profile. If you surpass 200 origin groups it is advised to create an additional Front Door profile.
One of the challenges when performing A/B testing is as we change the weight or expand the ruleset we are evaluating is that often in other global load balancers or CDN’s the user pools will be reset. With Front Door we can avoid this by ensuring that we enforce session affinity on our origin group. Without session affinity Front Door may route a single users requests to multiple origins. Once enabled, Azure Front Door adds a cookie to the user’s session. The cookies are called ASLBSA and ASLBSACORS. Cookie-based session affinity allows Front Door to identify different users even if behind the same IP address. This will allow us to dynamically adjust the weighting of our A/B testing without disrupting our existing user pool on either our A or B cluster.
Before we take a look at the example let’s first look at how we setup session affinity when using Front Door and AKS.
AKS & Reverse Proxies
When using Sticky Sessions with most Azure PaaS services no additional setup is required. For AKS as we use a reverse proxy in most cases to expose our services we need to take an additional step to ensure that our sessions remain sticky. This is because as mentioned Front Door uses a session affinity cookie which if the response is cacheable will not be done as it will disrupt the cookies of every other client requesting the same response. As a result if Front Door receives a cacheable response from the Origin a session cookie will not be set.
To ensure our responses are not cacheable we need to add a cache-control header to our responses. We have multiple options to do this. Below are two examples one for NGINX and one for Traefik.
NGINX
NGINX supports an annotation called configuration snippet. We can use it to set headers:
nginx.ingress.kubernetes.io/configuration-snippet: |
more_set_headers “Cache-Control: no-store”;
2. Traefik
Traefik does not support configuration snippets so on Traefik we can use the following custom-request-headers annotation:
ingress.kubernetes.io/custom-request-headers: “Cache-Control: no-store”
It’s important to note here we are talking about session affinity at the node level. For pod affinity please review the specific guidance for your selected ingress controller. This will be used in conjunction with Front Door session affinity.
Example – Session Affinity for A/B Testing
I will admit this is not the most thrilling demo to see as text and images but it does show how this can be validated. We use a container image that provides node and pod information to understand what pod/version of our application we have landed on. This is a public image and can be pulled here (scubakiz/servicedemo:1.0). This application is running on the same path across two clusters in the services-shared origin group. Front Door has session affinity enabled and the headers are set on both ingress paths. It is important to note that this application refreshes the browser every 10 seconds. Without session affinity you would notice your pod changing.
We initially set the US origin within the origin group to have 99% of the incoming traffic and when we access the web application we can see we are routed to a US deployment of our application. We can see that this pod exists in our US cluster.
When we adjust the weighting to be 99% to the UK cluster and open a new incognito tab we can see that we are now routed to our UK deployments. This weighting change takes about 5 minutes to take effect.
As mentioned this application refreshes every 10 seconds. This means that we are able to observe our original US user pool remaining on that cluster while new users are now directed to the UK user pool. We can see that by comparing the new pod details incognito window on the right to our UK pods. We can see in the bottom left that our constantly refreshing US Session is still connected.
Although this is an extreme example if we think of the UK pool as out B testing pool under the original weightings we could slowly increase the percentage of traffic from 1% to onboard more users without interrupting other users. Similarly at the point we wanted to go to 100% on a shared services cluster we could flip the traffic with assurance that the users on the old version will not suddenly be moved onto a new version.
Microsoft Tech Community – Latest Blogs –Read More