SDWAN+Firewall in self-managed hub-and-spoke designs
Designing network connectivity in public cloud can very quickly become a daunting task. Of course, public cloud providers do offer native networking services, and with those it is fairly easy. This should always be your primary route (pun intended). For example, in the case of Azure, using Virtual WAN and its native integration with both Microsoft and third-party connectivity appliances.
However, some times you have requirements that justify not using those native networking services, for example when you require more flexibility and control, or when your networking vendors of choice are not supported by the cloud provider solution.
By the way, before I continue, if you want to have a comprehensive read on all of the possible options of SDWAN connectivity with both Virtual WAN and hub and spoke topologies in Azure, I would recommend Adam Stuart’s blog on SDWAN in Azure. Another interesting resource is SDWAN integration with Hub and Spoke topologies, although that article lacks a couple of aspects such as integration with a firewall NVA as well as design alternatives without Azure Route Server.
So why writing more about this topic? I recently had a discussion with a couple of customers about this, and I created a framework that I use to guide the design decisions that will determine how to integrate SDWAN connectivity and firewalling into an Azure hub and spoke design.
It is essentially six questions whose answers are more or less orthogonal to each other, although there might be some dependencies. The sections below will go over each of the questions in detail, but here they are as an appetizer:
Question 1: routing FW-to-SDWAN will be static or dynamic?
Question 2: routing SDWAN-to-FW will be static or dynamic?
Question 3: route advertisement into SDWAN will be static or dynamic?
Question 4: routing spoke-to-FW will be static or dynamic?
Question 5: do you need routing to ExpressRoute?
Question 6: routing to other Azure regions will be static or dynamic?
Typically, environments with a good IP address planning and few topology changes can get away with static routing. If maintaining static routes is going to be a high operational burden, or the SDWAN technology of choice demands dynamic routing, then chances are you are going to need the dynamic routing capabilities that Azure Route Server gives you.
But I am getting ahead of myself, let’s start looking into each of the questions.
Decision 1: Routing Firewall-to-SDWAN
The first question refers to how do the Firewall NICs learn what packets to send to SDWAN. Firewalls will typically have two NICs that have different names depending on your firewall vendor: public/private, external/internal, untrusted/trusted, etc. I am going to go with external/internal here.
Inside of the firewall’s OS you will probably have some kind of static route saying “send public IP addresses to the external NIC, and private IP addresses to the internal NIC”. You need to remember that an Azure NIC is a fully functional routing device, so you need to teach that NIC how to forward the packets it gets from the firewall.
The firewall NIC will know how to forward packets addressed to Azure VNets, but it will not know anything about the SDWAN branches. There are a couple of ways to “teach” the NIC.
Option 1a: Your first option is of course static routing, for example for all RFC 1918 prefixes (172.16.0.0/12, 10.0.0.0/8 and 192.168.0.0/16). You will need a destination, which has to be an Azure Load Balancer in front of the SDWAN appliances.
This option is very easy to configure, but it doesn’t allow to send specific branches to specific SDWAN NVAs, what some SDWAN vendors such as VMware VeloCloud might require.
Option 1b: If you need to send specific branches to specific SDWAN NVAs, you can inject routes in the firewall NICs with BGP. The SDWAN appliances will send routes to an Azure Route Server, which will program them in the NICs inside of the FW’s subnet.
The Azure Route Server will actually try to program the learnt routes in every subnet in the VNet, so you will need a couple of route tables with the option “Propagate Gateway Routes” disabled. At a minimum, in the internal SDWAN subnet (otherwise you would create a routing loop).
This option is harder to configure than Option 1a, and you need to consider Azure Route Server route limits. If the amount of routes you need to inject exceeds the Route Server’s limits, summarization at the SDWAN NVAs might be required.
Option 1c: for the sake of completeness I am going to describe this option, although frankly I have hardly seen this due to the additional complexity that it entails. The previous two options describe how to teach the firewall’s NICs about the SDWAN prefixes, but what if you could bypass those NICs altogether? This is what an overlay of tunnels between the SDWAN and Firewall appliances would do, either with dynamic or static routing.
Traffic going through the tunnels would be opaque to Azure, so the NICs don’t need to know how to route it.
One downside of this approach is the potential bandwidth limitation that the NVAs might have for the overlay technology of choice. For example, many virtual machines have per-tunnel throughput limits if using IPsec encapsulation.
This design has an important particularity: the Firewall should know when to send a packet over the overlay, or when to send it to the underlay for Azure to route it to the spokes or to eventual ExpressRoute connections. The firewall usually will have some routes pointing private prefixes to their internal interface. You should be careful not to override these routes with summaries sent from the SDWAN NVAs (although static routes would usually have preference). Long story short: with this setup you typically want to send specific routes from the NVA appliances to the firewall over the VXLAN/IPsec tunnels, not summaries.
Decision 2: Routing SDWAN-to-Firewall
Now let us look at the opposite direction. You have here the same three options: static, BGP and overlay:
Option 2a: using static routing in the SDWAN subnet is the most popular option, since it is cost-effective and simple. The only downside is that static routes need to be maintained matching the prefixes of the spoke VNets, which might be a considerable overhead in very dynamic environments. The SDWAN NVAs will only send traffic to Azure that needs to go to Azure, so the only job of the SDWAN NICs is send the traffic on to the firewall.
Option 2b: This design is not recommended, since overly complex and expensive. If you really want to use dynamic routing to send traffic from the SDWAN subnet to the firewall appliances, you will need to change the topology, since using BGP for overriding the system routes introduced by VNet peerings is not supported. Additionally, you want to have separated route servers for injecting routes into the NVA and firewall subnets.
Option 2c: If you are using an overlay set of tunnels between your SDWAN and firewall appliances (option 1c in the previous section), this will naturally take care of the return traffic.
Decision 3: Route advertisement to SDWAN
A different problem is what prefixes the SDWAN appliances are going to advertise to the rest of the SDWAN network, which should include the VNet prefixes where the Azure workloads are deployed (10.1.1.0/24 and 10.1.2.0/24 in the diagram below).
Option 3a: Of course, if you can summarize the routes, you could just use static route redistribution in your SDWAN NVAs. In the example below, you could redistribute 10.1.0.0/16 and that should cover all of the VNets in the region.
This approach is very simple, but it requires a good IP Address Management concept for Azure, so that you can easily define summary prefixes that include all your spokes.
Option 3b: If defining those summary prefixes is not that easy, you could have the SDWAN appliances dynamically learning the spoke prefixes from Azure Route Server. In the simplest topology, you would use the Azure Route Server in the hub VNet, as the following figure shows:
Option 3c: A variant of the previous design is using dedicated Route Servers for the spokes, as the following diagram describes. The advantage of this approach is that you can have one Azure Route Server to control spoke routing, and another one to control hub routing. This will come handy in further sections:
Decision 4: Spoke-to-Firewall routing
Alright, we have routing between our network appliances, but how are the workloads in the spokes going to route traffic to the firewall? Here again we have the options of configuring static and dynamic routing.
Option 4a: User-Defined Routes in the spokes is the most common approach, by far. One downside is that preventing the possibility of spokes bypassing the firewall can be challenging, since you need to make sure that your the route tables in each spoke VNet are correct, and associated to all subnets.
Another potential downside of this approach is if an Azure Load Balancer for internal traffic is not supported by the firewall technology, for example because there is already a load balancer for the external NIC, and the firewall appliances do not support sending the load balancer probes to the right interfaces (these probes always come from the same IP address, so routing them in dual-NIC virtual machines is not trivial).
Here again, this 3a option is what you would be looking at if using Azure Firewall, unless you want to do this trick.
Option 4b: You know what is coming now: the firewalls could use Azure Route Server (here I am going for the topology where an extra ARS is dedicated to control the spokes). Of course, this one increases the design complexity, but it automates route injection to the spokes without the need of route tables.
Decision 5: redistribution to ER
Chances are that you have some ExpressRoute connectivity in your design. Maybe because some locations have not been migrated yet to SDWAN, or maybe because you are using Azure VMware Solution, which connects to Azure VNets with ExpressRoute circuits.
In this cases, you would want to redistribute SDWAN prefixes into ExpressRoute. In this case you cannot get around Azure Route Server, you will need one of these in your hub VNet.
By the way, if you had a dedicated spoke Route Server (option 2c) in your design, it would make things much better, because you could control which prefixes go to the ExpressRoute connections, and which ones to the spokes. Otherwise it wouldn’t be possible, since Azure Route Servers do not support filtering routes today.
I am not discussing VPN gateways in this article, because you have two options for them: either you apply the same principle that we described for the ExpressRoute gateways with BGP, or you configure static routing in the Local Network Gateway.
Decision 6: Routing across regions
I will assume here that you have workloads in more than one region. Do those workloads need to talk to each other across region boundaries? Essentially, do you have to support traffic patterns like Spoke1-Hub1-Hub2-Spoke2?
Option 6a: let’s start again with the option with static routing. You could just configure some static UDRs in each hub’s firewall internal subnet, to tell the NIC to route traffic for the remote region to the remote firewall cluster. For example, in the example below, when the firewall NICs in hub1 get a packet to anything in the other region (10.2.0.0/16), they will send it to the load balancer in front of the firewalls in that remote region.
This requires to have a good IPAM concept and can be complex to maintain with many regions.
Option 6b: What if you cannot easily summarize each region’s prefixes? Or what if you cannot have load balancers for the internal firewall NICs? Another design choice is connecting the SDWAN NVAs in different Azure regions to each other.
From the perspective of an Azure region, on-premises branches and remote Azure regions would be indistinguishable, both are reachable via the SDWAN appliances.
This option puts more load on the SDWAN data plane and control plane (number of routes), but it is more dynamic. Here again, you would have to be careful with the route limits of Route Server, if you have one in the design. Of course, this design option is possible when using static routing, or dynamic routing with Route Server:
If using dynamic routing with Azure Route Server so that the NVA appliances both learn and advertise prefixes to/from the Virtual Network, you need to be careful with the BGP Autonomous System Numbers (ASN). Azure Route Server today always has the ASN 65515, it is not configurable. Consequently, if one of the NVAs on the left region in the diagram learns a route from its local Route Server, the AS path will include 65515. When the route is propagated to the region on the right, the NVAs should strip that ASN from the routes before sending them to their Route Server (this feature is sometimes called as-override), otherwise the Route Server would drop the routes when seeing its own ASN in them, thinking that there is a routing loop.
If your NVA appliance does not support a feature like as-override, you need to either go for design option 2a (the SDWAN NVAs would generate their routes statically, instead of learning them from Azure Route Server) or 6a (using global VNet peerings for inter-region communication).
Option 6c: of course, you could build overlay tunnels across your firewalls too. I have hardly ever seen this option, since it is too complex for most organizations. I will not even create a diagram for this one, please excuse my laziness.
Both options would work if using Azure Firewall as the firewalling device.
What’s your design?
So what are you going to do? Will you route everything statically (options 1a/2a/3a/5/6a)? Or will you go full-on BGP with Azure Route Server (options 1b/2a/3c/4b/5/6b)? Or maybe a mix, where you do dynamic routing for certain aspects of the design, and static routing otherwise?
Please let me know with your comments, and do not hesitate to reach out if I have forgotten any requirement or dependency!
Microsoft Tech Community – Latest Blogs –Read More