TL/DR: Windows, dammit.
Backplot: The infrastructure at this one particular site has its origins in Engineers who know a whole lot about engineering but not so much about IT. So being the
cheap thrifty bunch that they are, they built their whole infrastructure with Linux. Linux DNS, Linux NIS, Linux DHCP. Then along came a nifty IT salesman who threw a lot of words around like "industry best practice" and convinced them that Windows was The Way Of The Future because Windows laptops are a fact of life even in this environment and getting a consistent handle on everything would be a good thing.
Implementing this Industry Best Practice would mean:
- a Windows AD, meaning
- Windows AD controller, meaning
- Windows DNS, so why not
- Windows DHCP, and
- IDMU to provide pseudo-NIS to the Linux systems, so why not go the whole hog and do
- Kerberos for Linux nodes so everyone has a consistent username and password everywhere.
Well people in suits like phrases like Industry Best Practice and the go was given, and along comes Your Hero to implement this Best Practice. And aside from some stupid things, progress is made. Active Directory is Active Directoring, all Windows systems which are not Home-strength are joined to the domain, the legacy NIS is imported into AD and then all Linux nodes are joined to the domain, DHCP is converted, and everyone starts using the new infrastructure.
And life is good.
Until it comes time to clean up the old infrastructure.
The old DHCP server got turned off when the Windows one was provisioned. Provisioning it was a bit of a pain, because these Engineers like the idea of reservation-only DHCP scopes. So all the reservations had to be extracted from the existing server, translated into Windows-speak, and imported there.
There are two DHCP servers, both with identical views of the world -- a list of reservations (which match) and no freely available IPs. Since there is really no dynamic information to store, there is no HA or clustering between the two DHCP servers.
Same with the DNS, Since everything was statically entered, it had to be all statically entered again. (A couple times, actually, since the first import didn't make a static entry, but that's Your Hero's fault.)
Even the NIS import worked fairly well, if rather clumsily. Your Hero is frankly impressed that it worked at all.
So to recap: the DHCP servers are turned off.
So that the AD migration can be done, the DHCP scopes are pointed to the new servers and the world is rebooted.
Migration to the NIS servers happens as part of the Kerberos migration, and exposes some weaknesses in the Engineer's network theory. See, they discovered that NIS servers and domain names are something that can be fed to Linux hosts as DHCP options. Which is a FANTASTIC idea, until you are doing a transition and some systems are on the old name/servers, and some are on the new name/servers. Which means every time you do a system conversion you have to whack the DHCP lease for that system to explicitly add the NIS domain and servers. Or hilarity ensues.
Pro tip: just because you CAN put something in DHCP doesn't mean that you SHOULD.
Anyways. Everyone is converted and the NIS servers are turned off.
The final step before Your Hero can mosey on into the sunset and pour himself a cold, stiff drink is to turn off the DNS server. This he does.
The next day, Your Hero starts getting phone calls from this site. The internet is not available! Well it is for some people. But not people with Important Computers.
Working the troubleshooting tree (have you tried turning it off and on again?), Your Hero discovers:
- it is only Windows systems which are affected
- the only Windows systems affected are Domain members
- the problem is that the system can't do DNS resolution when they are not working
- when the system is in this state, an ipconfig /all shows that the affected computer has as its settings the OLD DNS server IPs -- and the DHCP server IP is correct
- hard-coding the DNS servers fixes the problem, but since some of these computers are laptops that get taken away periodically this isn't a solution that will scale
- the problem happens between 5 and 15 minutes after the system is turned on
- the problem sometimes goes away 30 minutes after that, and comes and goes
- an ipconfig /release /renew sometimes fixes the problem, but not permanently
- a reboot usually (but not always) fixes the problem, again not permanently
Now at this point it is important to note that Your Hero is a network/Linux guy, not a Windows guy. However he sees the writing on the wall that Windows is The Way Of The Immediate Future If He Wants To Continue Getting Paid, so here he is.
His tame Windows Guys found something on the internet where if DHCP services are moved between servers and scope changes happen at the same time, sometimes old values can remain in the active scope even though the management applications all show appropriate values. The Solution to this problem, is to delete the scopes, delete the service, reboot, reinstall, and rebuild the scopes.
Your Hero has just done this, and It Is Still Broken. I have no ending for this story, Your Hero just needs something else to do for a bit so that Windows computers don't get punched.