Why Are You Talking To THAT Domain Controller!?

by Ryan 9. February 2013 11:05

I was in Salt Lake City most of this week. Being surrounded by stark snow-covered mountains made for some wonderful scenery... it could not be more different than it is here in Texas. Plus I got to meet and greet with a bunch of Novell and NetIQ people. And eat an enormous bone-in ribeye that no human being has any business eating in one sitting.

But anyway, here's a little AD mystery I ran in to a couple weeks ago, and it may not be as simple as you first think.

As Active Directory admins, something we're probably all familiar with is member servers authenticating with the "wrong" domain controller. By wrong, I mean a DC that is in a different site than the member server, when there's a perfectly fine DC right there in the same site as the member server, and so the member server is incurring cross-site communication when it doesn't need to be. Everything might still function well enough as long as the communication between DC and member server is successful, but now you're saturating your slower inter-site WAN links with AD traffic when you don't need to be. You should want your AD replication, group policy application, DFS referrals, etc., to run like a well-oiled machine.

I often work in a huge environment with AD sites in many countries and on multiple continents, and thousands of little /26 subnets that can't always be easily grouped into a predictable supernet for the purposes of linking subnets to sites in AD Sites & Subnets. So I'm always alert to the fact that if I log on to a server, and I notice that logon takes an abnormally long time, I very well could be logging on to the wrong DC. First, I run set log to see which DC I have logged on to:

set log*DC01 is in Amsterdam*

So in this case, I noticed that while I had logged on to a member server in Dallas, that server's logon server was a DC in Europe. :(

You immediately think "The server's IP subnet isn't defined in AD Sites & Services or is associated to the wrong site," don't you?  Yeah, me too. So I went and checked. Lo and behold, the server's IP subnet was properly defined and associated to the correct site in AD.

Now we have a puzzle. Back on the member server, I run nltest /dsgetsite to verify that the domain member does know to which site it belongs. (Which the domain member's NetLogon service stores in the registry in the DynamicSiteName value once it's discovered.)

I also ran nltest /dsgetdc:domain.com /Account:server01$ to essentially emulate the DC locator and selection process for that server, which basically just confirmed what we already knew:

C:\Users\Administrator>nltest /dsgetdc:domain.com /Account:server01$ 
           DC: \\DC01.DOMAIN.COM (In Amsterdam) 
      Address: \\10.0.2.55 
     Dom Guid: blah-blah-blah 
     Dom Name: DOMAIN.COM 
  Forest Name: DOMAIN.COM 
 Dc Site Name: Amsterdam 
Our Site Name: Arlington 
        Flags: GC DS LDAP KDC TIMESERV WRITABLE DNS_DC DNS_DOMAIN DNS_FOREST FUL
L_SECRET WS 
The command completed successfully

So where do we look next if there's no problem with the IP subnets in AD Sites & Services?  I'm going with DNS. We know that domain controllers register site-specific SRV records so that clients who know to which site they belong will know what DNS query to make to find domain controllers specific to their own site.  So what DNS records did we find for the Arlington site?

Forward Lookup Zones
    _msdcs.domain.com
        dc
            _sites
                Arlington
                    _kerberos SRV NewYorkDC
                    _kerberos SRV SanDiegoDC
                    _kerberos SRV MadridDC
                    _kerberos SRV ArlingtonDC
                    _ldap     SRV NewYorkDC
                    _ldap     SRV SanDiegoDC
                    _ldap     SRV MadridDC
                    _ldap     SRV ArlingtonDC

OK, now things are getting weird.  All of these other domain controllers that are not part of the Arlington site have registered their SRV records in the Arlington site.  The only way I can imagine that happening is because of Automatic Site Coverage, whereby domain controllers will register their own SRV records into sites where it is detected that the site has no domain controllers of its own... combined with the fact that scavenging is turned off for the DNS server, including the _msdcs zone.  So someone, once upon a time, must have created the Arlington site in AD before the actual domain controllers for Arlington were ready.  What's more is that Automatic Site Coverage is supposed to intelligently use site link costing so that only the domain controllers in the next closest site provide "coverage" for the site with no DCs, not every DC in the domain. Turns out the domain did not have a site link strategy either - it used DEFAULTIPSITELINK for everything - the entire global infrastructure. So even after Arlington did get some domain controllers, the SRV records from all the other DCs stayed there because of no scavenging.

Here's the thing though - did you notice that almost every other domain controller in the domain had SRV records registered in the Arlington site, except for the domain controller in Amsterdam that our member server actually authenticated to!?

This is getting kinda' nuts.  So what else, besides the DNS query, does a member server perform in order to locate a suitable domain controller?

So after a client does a DNS query for _ldap._tcp.SITENAME._sites.ForestDnsZones.domain.com, and gets a response, the client then begins to do LDAP queries against the DCs given in the DNS response to make sure that the DCs are alive and servicing requests. If you want to see this for yourself, I recommend starting Wireshark, and then restarting the NetLogon service while the capture is running. If it turns out that none of the DCs in the list that was returned by the site-specific DNS query is responding to your LDAP queries, then the client has to back up and try again with a domain-wide query.

And that is what was happening. The client, server01, was getting a list of DCs for its site, even ones that were erroneously there, but I confirmed that it was unable to contact any of those domain controllers over port 389. So after that failed, the server was forced to try again with a domain-wide query, where it finally found one domain controller that it could perform an LDAP query on... a domain controller in Amsterdam.

Moral of the story: Always blame the network guys.

 

Comments (5) -

Rafael France
8/5/2013 12:39:47 PM #

Hi !

We've exactly the same problem, what was the solution to fix that ?

Thank you !

Reply

Ryan United States
8/5/2013 1:31:03 PM #

In this case, even though the sites and subnets were configured correctly in AD, the client did not have network connectivity to the DC in its own site because of a poorly configured firewall, so the client was doing what it was designed to do and going outside of its site to find a DC.

Reply

James Cate
1/21/2014 3:21:51 PM #

And that is what was happening. The client, server01, was getting a list of DCs for its site, even ones that were erroneously there, but I confirmed that it was unable to contact any of those domain controllers over port 389. "So after that failed, the server was forced to try again with a domain-wide query", where it finally found one domain controller that it could perform an LDAP query on... a domain controller in Amsterdam.

When you say, "So after that failed, the server was forced to try a domain-wide query". What did it look up specifically for a "domain-wide" query?

Reply

Ryan Ries
1/22/2014 5:02:26 PM #

The client queries for the name _LDAP._TCP.dc._msdcs.domainName and receives a list of all domain controllers in the domain, essentially in random order.  It then walks the list, attempting LDAP binds with each server in the list until it finds one it can talk to.

Reply

Eric
4/8/2014 3:25:21 AM #

I love the moral of the story :)

Reply

Add comment

About Me

Name: Ryan Ries
Location: Texas, USA
Occupation: Systems Engineer 

I am a Windows engineer and Microsoft advocate, but I can run with pretty much any system that uses electricity.  I'm all about getting closer to the cutting edge of technology while using the right tool for the job.

This blog is about exploring IT and documenting the journey.


Blog Posts (or Vids) You Must Read (or See):

Pushing the Limits of Windows by Mark Russinovich
Mysteries of Windows Memory Management by Mark Russinovich
Accelerating Your IT Career by Ned Pyle
Post-Graduate AD Studies by Ned Pyle
MCM: Active Directory Series by PFE Platforms Team
Encodings And Character Sets by David C. Zentgraf
Active Directory Maximum Limits by Microsoft
How Kerberos Works in AD by Microsoft
How Active Directory Replication Topology Works by Microsoft
Hardcore Debugging by Andrew Richards
The NIST Definition of Cloud by NIST


MCITP: Enterprise Administrator

VCP5-DCV

Profile for Ryan Ries at Server Fault, Q&A for system administrators

LOPSA

GitHub: github.com/ryanries

 

I do not discuss my employers on this blog and all opinions expressed are mine and do not reflect the opinions of my employers.