DNS 101: Round Robin (Or Back When I was Young And Foolish Part II)

by Ryan 14. January 2012 19:50

I learned something today. It's something that made me feel stupid for not knowing. Something that seemed elemental and trivial - yet, I did not know it. So please, allow me to relay my somewhat embarrassing learning experience in the hopes that it will save someone else from the same embarrassment.

I did know what DNS round robin was. Or at least, I would have said that I did.

Imagine you configure DNS1, as a DNS server, to use round robin. Then, you create 3 host (A or AAAA) records for the same host name, using different IPs. Let's say we create the following A records on DNS1:

server01 - A 10.0.0.4
server01 - A 10.0.0.5
server01 - A 10.0.0.6

Then on a workstation which is configured to use DNS1 as a DNS server, you ping server01. You receive 10.0.0.4 as a reply. You ping server01 again. With no hesitation, you get a reply from 10.0.0.4 again. We assume that your local workstation has cached 10.0.0.4 locally and will reuse that IP for server01 until the entry either expires, or we flush the DNS cache on the workstation with a command like ipconfig/flushdns.

I run ipconfig/flushdns. Then I ping server01 again.

This time I receive a response from 10.0.0.5. Now I assume DNS round robin is working perfectly. I go home for the day feeling like I know everything there is to know about DNS.

But was it that the DNS server is responding to DNS queries with the single next A/AAAA record that it has on file, in a round-robin type sequential fashion to every DNS query that it receives? That is what I assumed.

But the fact of the matter is that DNS servers, when queried for a host name, actually return a list of all A/AAAA records associated with that host name, every time that host name is queried for. (To a point - the list must fit within a UDP packet, and some firewalls/filters don't let UDP packets longer than 512 bytes through. That's changing though. Our idea of how big data is and should be allowed to be is always growing.)

I assume that www.google.com, being one of the busiest websites in the world, has not only some global load balancing and other advanced load balancing techniques employed, but probably also has more than one host record associated with it. To test my theory, I fire up Wireshark and start a packet capture. I then flush my local DNS cache with ipconfig/flushdns and then ping www.google.com.

Notice how I pinged it, got one IP address in response (.148), then flushed my DNS cache, pinged it again and got another different IP address (.144)? But despite what it may look like, that name server is not returning just one A/AAAA record each time I query it:


*Click for Larger*

My workstation is ::9. My workstation's DNS server is ::1. The DNS server is configured to forward DNS requests for zones for which it is not authoritative on to yet another DNS server. So I ask for www.google.com, my DNS server doesn't know, so it forwards the request. The forwardee finally finds out and reports back to my DNS server, which in turn relays back to me a list of all the A records for www.google.com. I get a long list containing not only a mess of A records, but a CNAME thrown in there too, all from a single DNS query! (We're not worried about the subsequent query made for an AAAA record right now. Another post perhaps.)

I was able to replicate this same behavior in a sanitary lab environment running a Windows DNS server and confirmed the same behavior. (Using the server01 example I mentioned earlier.)

Where round robin comes in is that it rotates the order of the list given to each subsequent client who requests it. Keep in mind that while round robin-ing the A records in your DNS replies does supply a primitive form of load distribution, it's a pretty poor substitute for real load balancing, since if one of the nodes in the list goes down, the DNS server will be none the wiser and will continue handing out the list with the downed node's IP address on it.

Lastly, since we know that our client is receiving an entire list of A records for host names which have many IP addresses, what does it actually do with the list?  Well, the ping utility doesn't do much. If the first IP address on the list is down, you get a destination unreachable message and that's it. (Leading to a lot of people not realizing they have a whole list of IPs they could try.) Web browsers however, have a nifty feature known as "browser retry" or "client retry," where they will continue trying the other IPs in the list until they find a working one. Then they will cache the working IP address so that the user does not continue to experience the same delay in web page loading as they did the first time. Yes, there are exploits concerning this feature, and yes it's probably a bad idea to rely on this since browser retry is implemented differently across every different browser and operating system. It's a relatively new mechanism actually, and people may not believe you if you tell them. To prove it to them, find (or create) a host name which has several bad IPs and one or two good ones. Now telnet to that hostname. Even telnet (a modern version from a modern operating system) will use getaddrinfo() instead of gethostbyname() and if it fails to connect the first IP, you can watch it continue trying the next IPs in the list.

More info here, here and here. That last link is an MSDN doc on getaddrinfo(). Notice that it does talk about different implementations on different operating systems, and that ppResult is "a pointer to a linked list of one or more addrinfo structures that contains response information about the host."

Tags:

BWIWYAF | IT Professional | Linux | Software | Windows | Windows Server

Back When I Was Young and Foolish (Part I of Oh-So-Many)

by Ryan 13. December 2011 07:20

I'd like to do a little reminiscing now, of a time when I was young and foolish. I anticipate many such posts, which I guess means I spend a lot of time being young and foolish.

It was my first real IT job, and it'd be fair to call me a sysadmin. I was basically just an assistant to the lead architect. I was doing things like setting up backup schedules in DPM and creating user accounts in AD. Just the usual stuff you'd expect a guy first getting started in IT would be doing.

One day, my boss and I were discussing one of the many issues caused by being in the middle of a domain migration, and that was that we currently had employees working in two different Windows domains simultaneously. For various reasons, there would not and could not be a trust relationship established between the two. To further complicate matters, we now needed two separate ISA (now known as TMG) servers at the office, one for each domain, so as to keep the employees off of Facebook and ESPN.com. One might suggest that it was a social problem that should be dealt with by the employees' managers, but the managers were among the worst offenders. But I digress.

Now since both of these domains are on the same subnet, we can really only have one DHCP server. So if we only have one DHCP server, how are we going to push out two different sets of options to the clients, such as which proxy server to use, to members of either domain?

We settled on using two different DHCP classes.

So while we were fiddling around with the DHCP server, for no good reason we decided that right now was a great time to screw around with the basic settings of a DHCP server that had been serving us faithfully with no problems for years.

What is a good DHCP lease duration, really? Is it the default of 8 days? Well I suppose that depends on a lot of factors, such as how mobile your DHCP clients are and how often you expect them to be connecting to and disconnecting from the network, how many DHCP addresses you have to distribute, etc..  But I've already put more thought into it just now than we did on that day, for we almost completely arbitrarily set the DHCP lease duration to 30 minutes.

Fast forward a few months.

Everything had been working great and the office DHCP server, as DHCP servers should be, was all but forgotten about. I had just been offered a much higher-paying job with a much larger IT company, so I was now a short-timer at this job. Then, out of the blue, our primary domain controller (the one hosting all the FSMOs) at our remote datacenter goes dark.  I can still ILO into it, but both of its network adapters are just gone. No error messages in the logs, no sign that Windows had ever known that it did once have network connections, nothing.  It just looked like it had never had network adapters in the first place.  (This still perturbs me to this day. Why wasn't there an event in either the Windows logs or in the IML that said "Uhh dude where did your network adapters go!?")

After seizing the FSMO roles on the backup DC, we took a trip out to the datacenter to have a look. Sure enough, the link lights on the physical hardware were out, and upon rebooting, the BIOS of the machine didn't even recognize that it had ever had network adapters in it. So I had HP come out and replace the motherboard, which fixed the issue. I don't know what the point of having redundant NICs is if they're wired to both blow at the same time, but I digress again...

So now I have essentially a brand new server out of what used to be our primary domain controller. The question, my boss wanted to know, was how do we restore a former DC after its FSMO roles have been seized? My answer: You don't. You never bring it back. Ever. If I don't wipe the hard drive right now, it thinks it's still the owner of those FSMO roles, and the only way for it to learn that it no longer owns those FSMO roles is to connect it back to the domain network and let the KCC do its magic.  What's going to happen in that span of time that we now have two DCs in the domain that both simultaneously think that they own all the FSMOs? I didn't want to find out. So I argued that we just rebuild it and I eventually got my way.

Now since I had to rebuild the machine, I figured now was as good a time as any to upgrade it from Windows 2008 to Windows 2008 R2. Having spent the last several months studying my ass off for the MCITP tests and passing them, I was feeling pretty self-confident. I took the freshly rebuilt server back out to the datacenter. I re-cabled it and re-racked it, and powered it on. I grabbed the 2008 R2 version of adprep.exe and on the other domain controller, used it to run adprep /forestprep followed by adprep /domainprep. The domain was now ready to accept a 2008 R2 domain controller. I dcpromo'ed the new machine, and it went without a hitch. I then gracefully transferred the FSMOs back from the other DC. (I didn't alter the actual functional levels of the forest or domain.)

Now I had one new, fresh shiny 2008 R2 domain controller in the datacenter, with 5 other DCs in 2 other sites that were still on Windows 2008. Still high off my earlier success, I felt like this was a great time to do an in-place upgrade on the rest of those DCs.  One by one, I upgraded them from 2008 to 2008 R2, and you know what? I was actually pleasantly surprised at how smooth and painless it was. My hat goes off to Microsoft for making an in-place upgrade of a production domain controller so seamless.  I just remoted into the server, mounted the 2008 R2 media and hit "Upgrade."  That's all there was to it. The installation took about 30 minutes, and when the machine came back up, it was a pristine 2008 R2 box.

Everything was going great, until I got around to upgrading the last of the 6 DCs. That sixth DC was the DHCP server at the office. I followed the same procedure that had gone off without a hitch 5 times already. Confident that all was well, I set off to browse reddit while the DC upgrade finished. I was just thinking about how the upgrade was going kind of slow, when suddenly my Internet connection dropped. At the exact same instant, the phone rang.

Me: "Hello?"
Boss: "What did you do?"
Me: "Wha, I don't, I ..."
Boss: "RUN to the server room and set up a new DHCP scope. Now."

The entire office had just lost their Internet connections. In the intervening months, I had completely forgotten that we had set the DHCP lease duration on the domain controller to... oh, just a hair shorter than the amount of time it takes to complete an in-place upgrade on that domain controller. Had the lease time been just 5 minutes longer, the DC/DHCP server would have finished the upgrade and been back up serving renewals as if nothing had ever happened. By the time I had gotten the replacement scope set up on the other DC and done the dance of de-authorizing the first DC as as a DHCP server in Active Directory, and authorizing the new one, the original DC was back up and ready.

After about 30 minutes of repairing the damage, I sheepishly emerged from the server room, to the sound of ironic applause coming from all the employees in their cubicles.

Tags:

BWIWYAF | IT (Not Very) Professional | IT Professional | Windows Server

About Me

Name: Ryan Ries
Location: Texas, USA
Occupation: Systems Engineer 

I am primarily a Windows engineer/architect and Microsoft advocate, but I can run with pretty much any system that uses electricity.  I'm all about getting closer to the cutting edge of technology, and using the right tool for the job.

This blog is about exploring IT and documenting the journey.

 

MCITP: Enterprise Administrator

Profile for Ryan Ries at Server Fault, Q&A for system administrators

LOPSA