Latest Stories
M1 Max MacBook Pro and External Display: When Productivity Becomes Frustration Social Media Limbo kamshin.com on hiatus Some thoughts on the new MacBook Pro Opportunities And Challenges With Personal Health Data – Looking at Garmin Data

kamshin

  • Home
  • All Posts
  • Categories
    • General
    • Tech Field Day
    • Storage
    • Nutanix
    • Certifications
    • Conferences
    • Worth reading
    • Design
    • Rants
    • Active Directory
  • Media & Press
  • Podcast
  • About me
    • About me
    • Where is Max?
    • Disclosure & Policies

Dr. Redundancy, or How I learned To Love The Isolated Mode

March 4, 2010

FacebookTwitter

If you haven’t seen my yesterday enraged tweets, something extremely odd happened to me yesterday while using vmWare ESXi 3.5. After having strenghtened the redundancy of our production servers virtual switch, we decided to add fault tolerance to the Service Console virtual switch as well. Everything was planned and identified weeks ahead : which NIC adapter, which patch panel ports and which switches, all of course using different cable routes to allow for a fully resilient connectivity.

The D-Day finally came and we decided to proceed. The unplugged virtual NIC adapters were added to the virtual switches, cables were patched between physical ESX servers and patch panels, then finally we patched from patch panel to the network switches on which resided the existing Service Console connectivity. Switches ports flashed orange, then green … link was up.. Victory! We headed to the office to check our brilliant result in vCenter service console.

With a smile, I noticed both vmnics up and connected on the Service Console vswitch. And suddenly, it was darkness. Lost connectivity to vmWare vCenter Client for Virtual Infrastructure. Pinging vCenter host. Dead. Checking Fileserver network shares. Also dead. Print Server? Dead. All our virtuals unresponsive. Quickly, while cursing, to the comms room to unpatch the suspect NICs. So, network issue ?

Worse than this. I connect on both ESX consoles, one of them went in Isolated Mode. Default vmWare High Availability behavior powered off all our servers…. quickly remove the NICs from the vSwitches through commandline (it’s esxcfg-vswitch vSwitch0 Β -U vmnic7 if you want to remove vmnic7 from vSwitch0, for example), log on each ESX server separately with vCenter Client, to check the extent of the damage…

Phew… All virtuals on node 1 were properly shut down (Guest OS shutdown). All virtuals on node 2 were up and running. Started vCenter, then the other virtuals; after a few minutes, the time for all these folks to boot up, we were back to normality, except the stress, the adrenaline and the wicked excitement of working under real pressure conditions.

Lessons learned? Quite a few.

  • Shit Happens: a taste of Murphy’s Law
    We consulted today with our 3rd party provider, who couldn’t see a reason why it failed. We digged deep the matter together, and found out that the only plausible reason why this happened is that the vmnic was added to the vswitch before the port was alive on the switch. This, combined with the default timeout for Service Console set to 15000 milliseconds may have caused one of the ESX nodes to go into Isolated Mode, if we consider that the whole process patching-uplink-ip acquisition was longer than 15 seconds. You can change this default timeout by adding das.failuredetectiontime parameter to vmWare HA Advanced Options in the vCenter Client.Β What could confirm this single hypothesis is that we proceeded successfully today to add redundancy by performing all the cable patching, and then by adding the vmnic to the Service Console vSwitch after having ensured that vmWare reported the vmnic as up with a configured speed.
  • The inevitable corollary of Murphy’s Law : Things Will Fail
    Because the High Availability configuration was not a topic discussed during implementation, the default Isolated Mode behavior was left by default on Shut Down virtual host which, because of a single timeout, sent our whole virtualized production environment down. As the production vSwitch wasn’t affected by the loss of connectivity to the Service Console, leaving the virtual machine behavior setting to “Keep VMs Powered On” would have made it way easier for us. This has also been amended.
  • Improving the Stethoscope, Clearer Listening to Heartbeats
    We changed the default heartbeat detection mode from Port ID to IP Hash, to ensure the heartbeats get received from a fully distinct path.
  • When Fail Takes Over
    Because things can be complicated when they could be easy, we also found out that while node 1 went into isolated mode, HA tried to restart our VMs on node 2. However… VMs on node 1 were still shutting down, therefore their vmware swapfile was still locked, which in turn denied node 2 from successfully starting them.

    For a final note to this lengthy (but hopefully useful) post, NEVER perform any changes during business hours, even if you’ve read a thousand articles about how safe it is… because, ultimately, things will fail! πŸ™‚

    Share this:

    • Click to share on Facebook (Opens in new window)
    • Click to share on LinkedIn (Opens in new window)
    • Click to share on Twitter (Opens in new window)
    • Click to share on Reddit (Opens in new window)

    Related

    A note to our readers

    kamshin.com has a strict no advertisement policy. If you enjoy this website, please consider making a donation to one of these non-profit organizations that I personally support:


    People in Need - Czech Republic

    A Czech-based non-governmental, non-profit organization founded on the ideals of humanism, freedom, equality and solidarity, helping people in the Czech Republic and in the entire world.

    People In Need Logo

    Greenpeace

    Hopefully this one doesn't requires any explanation. Act for our planet. Act now.

    Greenpeace Logo

    826 National

    US-based charity. An international proof point for writing as a tool for young people to ignite and channel their creativity, explore identity, advocate for themselves and their community, and achieve academic and professional success.

    826 National Logo

     


    Electronic Frontier Foundation

    The leading nonprofit defending digital privacy, free speech, and innovation.

    EFF Logo

     


    Thank you!

    RSS Latest Podcast Episodes

    • EP 30 -Rose Ross Chief Tech Trailblazer on the Tech Trailblazer awards
    • EP29 – Imagine the possibilities to manage your data with Data Dynamics StorageX – with Piyush Mehta
    • EP28 – Introducing Clumio, A Cloud-Based Data Platform Launching With Data Protection As A Service – with Poojan Kumar
    • EP27 – VAST Data – A Revolutionary Storage Platform For The Next Decade – with Howard Marks

    Categories

    • Active Directory (5)
    • Certifications (8)
    • Conferences (22)
    • Design (1)
    • Featured (1)
    • General (89)
    • Nutanix (4)
    • Rants (2)
    • Storage (38)
    • Tech Field Day (44)
    • Worth reading (4)

    Latest Tweets

    My Tweets

    Popular posts this week

    • Rubrik - A Refreshing Approach to Backups
    • My move from Gmail to ProtonMail: a comprehensive report on gaining back my privacy
    • Using Virtual Machine custom attributes with PowerCLI for snapshotting
    • Pure Storage's FlashBlade - Against The Grain
    • Oracle Cloud Strategy: Part 1 - Oracle Ravello Cloud Service

    Categories

    • Active Directory
    • Certifications
    • Conferences
    • Design
    • Featured
    • General
    • Nutanix
    • Rants
    • Storage
    • Tech Field Day
    • Worth reading

    Pages

    • Blog
    • Disclosure & Policies
    • Home
    • Media & Press
    • VCAP5-DCD Resources
    • VCP5 Certification Resources
    • About me

    Archives

    Copyright ©2016 kamshin