I was in Salt Lake City most of this week. Being surrounded by stark snow-covered mountains made for some wonderful scenery... it could not be more different than it is here in Texas. Plus I got to meet and greet with a bunch of Novell and NetIQ people. And eat an enormous bone-in ribeye that no human being has any business eating in one sitting.
But anyway, here's a little AD mystery I ran in to a couple weeks ago, and it may not be as simple as you first think.
As Active Directory admins, something we're probably all familiar with is member servers authenticating with the "wrong" domain controller. By wrong, I mean a DC that is in a different site than the member server, when there's a perfectly fine DC right there in the same site as the member server, and so the member server is incurring cross-site communication when it doesn't need to be. Everything might still function well enough as long as the communication between DC and member server is successful, but now you're saturating your slower inter-site WAN links with AD traffic when you don't need to be. You should want your AD replication, group policy application, DFS referrals, etc., to run like a well-oiled machine.
I often work in a huge environment with AD sites in many countries and on multiple continents, and thousands of little /26 subnets that can't always be easily grouped into a predictable supernet for the purposes of linking subnets to sites in AD Sites & Subnets. So I'm always alert to the fact that if I log on to a server, and I notice that logon takes an abnormally long time, I very well could be logging on to the wrong DC. First, I run set log to see which DC I have logged on to:
*DC01 is in Amsterdam*
So in this case, I noticed that while I had logged on to a member server in Dallas, that server's logon server was a DC in Europe. :(
You immediately think "The server's IP subnet isn't defined in AD Sites & Services or is associated to the wrong site," don't you? Yeah, me too. So I went and checked. Lo and behold, the server's IP subnet was properly defined and associated to the correct site in AD.
Now we have a puzzle. Back on the member server, I run nltest /dsgetsite to verify that the domain member does know to which site it belongs. (Which the domain member's NetLogon service stores in the registry in the DynamicSiteName value once it's discovered.)
I also ran nltest /dsgetdc:domain.com /Account:server01$ to essentially emulate the DC locator and selection process for that server, which basically just confirmed what we already knew:
C:\Users\Administrator>nltest /dsgetdc:domain.com /Account:server01$
DC: \\DC01.DOMAIN.COM (In Amsterdam)
Dom Guid: blah-blah-blah
Dom Name: DOMAIN.COM
Forest Name: DOMAIN.COM
Dc Site Name: Amsterdam
Our Site Name: Arlington
Flags: GC DS LDAP KDC TIMESERV WRITABLE DNS_DC DNS_DOMAIN DNS_FOREST FUL
The command completed successfully
So where do we look next if there's no problem with the IP subnets in AD Sites & Services? I'm going with DNS. We know that domain controllers register site-specific SRV records so that clients who know to which site they belong will know what DNS query to make to find domain controllers specific to their own site. So what DNS records did we find for the Arlington site?
Forward Lookup Zones
_kerberos SRV NewYorkDC
_kerberos SRV SanDiegoDC
_kerberos SRV MadridDC
_kerberos SRV ArlingtonDC
_ldap SRV NewYorkDC
_ldap SRV SanDiegoDC
_ldap SRV MadridDC
_ldap SRV ArlingtonDC
OK, now things are getting weird. All of these other domain controllers that are not part of the Arlington site have registered their SRV records in the Arlington site. The only way I can imagine that happening is because of Automatic Site Coverage, whereby domain controllers will register their own SRV records into sites where it is detected that the site has no domain controllers of its own... combined with the fact that scavenging is turned off for the DNS server, including the _msdcs zone. So someone, once upon a time, must have created the Arlington site in AD before the actual domain controllers for Arlington were ready. What's more is that Automatic Site Coverage is supposed to intelligently use site link costing so that only the domain controllers in the next closest site provide "coverage" for the site with no DCs, not every DC in the domain. Turns out the domain did not have a site link strategy either - it used DEFAULTIPSITELINK for everything - the entire global infrastructure. So even after Arlington did get some domain controllers, the SRV records from all the other DCs stayed there because of no scavenging.
Here's the thing though - did you notice that almost every other domain controller in the domain had SRV records registered in the Arlington site, except for the domain controller in Amsterdam that our member server actually authenticated to!?
This is getting kinda' nuts. So what else, besides the DNS query, does a member server perform in order to locate a suitable domain controller?
So after a client does a DNS query for _ldap._tcp.SITENAME._sites.ForestDnsZones.domain.com, and gets a response, the client then begins to do LDAP queries against the DCs given in the DNS response to make sure that the DCs are alive and servicing requests. If you want to see this for yourself, I recommend starting Wireshark, and then restarting the NetLogon service while the capture is running. If it turns out that none of the DCs in the list that was returned by the site-specific DNS query is responding to your LDAP queries, then the client has to back up and try again with a domain-wide query.
And that is what was happening. The client, server01, was getting a list of DCs for its site, even ones that were erroneously there, but I confirmed that it was unable to contact any of those domain controllers over port 389. So after that failed, the server was forced to try again with a domain-wide query, where it finally found one domain controller that it could perform an LDAP query on... a domain controller in Amsterdam.
Moral of the story: Always blame the network guys.