Stop Worrying and Love the Outage, Vol II: DCs, custom ports, and Firewalls/ACLs
Hello, it’s Chris Cartwright from the Directory Services support team again. This is the second entry in a series where I try to provide the IT community with some tools and verbiage that will hopefully save you and your business many hours, dollars and frustrations. Here we’re going to focus on some major direct and/or indirect changes to Active Directory that tend be pushed onto AD administrators. I want to arm you with the knowledge required for those conversations, and hopefully some successful deflection. After all, isn’t it better to learn the hard lessons from others, if you can?
Periodically, we get cases for replication failures, many times, involving lingering objects. Almost without fail, the cause is one of the following reasons:
DNS
SACL/DACL size
Network communication issues
Network communications issues almost always come down to a “blinky box” in the middle that is doing something it shouldn’t be, whether due to defective hardware/software, misconfiguration or the ever-present misguided configuration. Today, we’re going to focus on the third, a misguided configuration. That is to say, the things that your compliancy section has said must be done, that have little to no security benefit, but can easily result in a multi-hour, multi-day, or even (yes) multi-long outage. To be fair, the portents of an outage should be readily apparent with any monitoring in place. However, sometimes reporting agents are not installed, fail to function properly, or misconfigured or events themselves are missed (alert fatigue). So, one of things to do when compliancy starts talking about locking down DC communications is to ask them…
What is the problem you are trying to solve?
Have you been asked to isolate DCs? Create a lag site? Make sure that X DCs can only replicate with Y DCs?
The primary effect of doing any of this is alert fatigue for replication errors, which is a path to outage later. Additionally, if you have “Bridge all site links” enabled, you are giving the KCC the wrong information to create site links.
Don’t permanently isolate DCs
Don’t create lag sites
Do make sure you have backups in place
Do make sure KCC has the correct information, and then let it be unless your network topology changes.
Do make sure all DCs in the forest can reach all DCs in the forest (If your networks are fully routable)
Have you been asked to configure DCs to restrict the RPC ports they can use?
Every AD administrator should be familiar with the firewall ports and how RPC works. By default, DCs will register on these port ranges and listen for traffic. The RPC Endpoint Mapper keeps track of these ports and tells incoming requests where to go. One thing that RPC Endpoint Mapper will not do is keep track of firewall or ACL changes that were made.
Again, what is the security benefit here? It is one thing to control DC communications outbound from the perimeter. It is another thing to suggest that X port is more secure than Y port, especially when we’re talking about ports upon which DCs are listening. If your compliancy team is worried about rogue applications listening on DCs, you have bigger problems..like rogue applications existing on your DCs, presumably put there by rogue agents who now have control over your entire Active Directory.
The primary effect of “locking down” a DC in this way is not to improve security, but to mandate the creation or modification of some document with fine print, “Don’t forget to add these registry keys to avoid an outage”, that will inevitably be lost during turnover. Furthermore, going too far can lead to port exhaustion, another type of outage.
Don’t restrict AD/Netlogon to static ports without exhaustively discussing the risks involved, and heavily documenting it.
Don’t restrict the RPC dynamic range without exhaustively discussing the risks involved, and heavily documenting it.
Do restrict inbound/outbound perimeter traffic to your DCs.
“Hey, you said multi-day or multi-week outages. It’s not that hard to fix replication!”
It is true that once you’ve found the network issue preventing replication that it is usually an easy fix. However, if the “easy” fix is to rehost all your global catalog partitions with tens of thousands of objects on 90+ DCs, requiring manual administrative intervention, and a specific sequence of commands because your environment is filled with lingering objects, you’re going to be busy for a while.
Wrapping it up
As the venerable Aaron Margosis said, “…if you stick with the Windows defaults wherever possible or industry-standard configurations such as the Microsoft Windows security guidance or the USGCB, and use proven enterprise management technologies instead of creating and maintaining your own, you will increase flexibility, reduce costs, and be better able to focus on your organization’s real mission.”
Security is critical in this day and age, but so is understanding the implications and reasons beyond some check box on an audit document. Monitoring is also critical, but of little use if polluted with noise. Remember who will be mainlining caffeine all night to get operations back online when the lingering objects start rolling in, because it will not be the people that click the “scan” button once a month…
References
Creating a Site Link Bridge Design | Microsoft Learn
You Are Not Smarter Than The KCC | Microsoft Learn
Configure firewall for AD domain and trusts – Windows Server | Microsoft Learn
RPC over IT/Pro – Microsoft Community Hub
Remote Procedure Call (RPC) dynamic port work with firewalls – Windows Server | Microsoft Learn
Restrict Active Directory RPC traffic to a specific port – Windows Server | Microsoft Learn
10 Immutable Laws of Security | Microsoft Learn
Sticking with Well-Known and Proven Solutions | Microsoft Learn
Chris “Was it really worth it” Cartwright
Microsoft Tech Community – Latest Blogs –Read More