If you haven’t seen my yesterday enraged tweets, something extremely odd happened to me yesterday while using vmWare ESXi 3.5. After having strenghtened the redundancy of our production servers virtual switch, we decided to add fault tolerance to the Service Console virtual switch as well. Everything was planned and identified weeks ahead : which NIC adapter, which patch panel ports and which switches, all of course using different cable routes to allow for a fully resilient connectivity.
The D-Day finally came and we decided to proceed. The unplugged virtual NIC adapters were added to the virtual switches, cables were patched between physical ESX servers and patch panels, then finally we patched from patch panel to the network switches on which resided the existing Service Console connectivity. Switches ports flashed orange, then green … link was up.. Victory! We headed to the office to check our brilliant result in vCenter service console.
With a smile, I noticed both vmnics up and connected on the Service Console vswitch. And suddenly, it was darkness. Lost connectivity to vmWare vCenter Client for Virtual Infrastructure. Pinging vCenter host. Dead. Checking Fileserver network shares. Also dead. Print Server? Dead. All our virtuals unresponsive. Quickly, while cursing, to the comms room to unpatch the suspect NICs. So, network issue ?
Worse than this. I connect on both ESX consoles, one of them went in Isolated Mode. Default vmWare High Availability behavior powered off all our servers…. quickly remove the NICs from the vSwitches through commandline (it’s esxcfg-vswitch vSwitch0 -U vmnic7 if you want to remove vmnic7 from vSwitch0, for example), log on each ESX server separately with vCenter Client, to check the extent of the damage…
Phew… All virtuals on node 1 were properly shut down (Guest OS shutdown). All virtuals on node 2 were up and running. Started vCenter, then the other virtuals; after a few minutes, the time for all these folks to boot up, we were back to normality, except the stress, the adrenaline and the wicked excitement of working under real pressure conditions.
Lessons learned? Quite a few.
- Shit Happens: a taste of Murphy’s Law
We consulted today with our 3rd party provider, who couldn’t see a reason why it failed. We digged deep the matter together, and found out that the only plausible reason why this happened is that the vmnic was added to the vswitch before the port was alive on the switch. This, combined with the default timeout for Service Console set to 15000 milliseconds may have caused one of the ESX nodes to go into Isolated Mode, if we consider that the whole process patching-uplink-ip acquisition was longer than 15 seconds. You can change this default timeout by adding das.failuredetectiontime parameter to vmWare HA Advanced Options in the vCenter Client. What could confirm this single hypothesis is that we proceeded successfully today to add redundancy by performing all the cable patching, and then by adding the vmnic to the Service Console vSwitch after having ensured that vmWare reported the vmnic as up with a configured speed.
- The inevitable corollary of Murphy’s Law : Things Will Fail
Because the High Availability configuration was not a topic discussed during implementation, the default Isolated Mode behavior was left by default on Shut Down virtual host which, because of a single timeout, sent our whole virtualized production environment down. As the production vSwitch wasn’t affected by the loss of connectivity to the Service Console, leaving the virtual machine behavior setting to “Keep VMs Powered On” would have made it way easier for us. This has also been amended.
- Improving the Stethoscope, Clearer Listening to Heartbeats
We changed the default heartbeat detection mode from Port ID to IP Hash, to ensure the heartbeats get received from a fully distinct path.
- When Fail Takes Over
Because things can be complicated when they could be easy, we also found out that while node 1 went into isolated mode, HA tried to restart our VMs on node 2. However… VMs on node 1 were still shutting down, therefore their vmware swapfile was still locked, which in turn denied node 2 from successfully starting them.
For a final note to this lengthy (but hopefully useful) post, NEVER perform any changes during business hours, even if you’ve read a thousand articles about how safe it is… because, ultimately, things will fail! 🙂