Hello everyone!
I hope you've had a nice summer time and enjoyable holidays as well! No such fun times for me, because (if you don't know – which means you don't follow me on twitter), I've decided to move along in my career and I will be moving to a new role in a few days. I hope to post more once the change is fully effective. If you want to get to the technical stuff, skip along these two paragraphs below 🙂
Writing this article took me almost two weeks (I have two kids, family first and sleep deficit are my current friends) so excuse me in advance for any imprecisions or discrepancies.
So, to get back to it, I've spent most of the summer working on finalizing various projects, among which was one for which I had been longing for a year and half: migrating our infrastructure from ESX to ESXi. This is now Mission Accomplished, but the road wasn't bumpless. Incidentally, the major showstopper wasn't the VMware hypervisor migration itself, but our reliance on a 3rd party product sold by HP as Storage Mirroring Recovery for Virtual Infrastructure (a rebranded Doubletake product), which was developed in the ESX 3.5 era and relied on service console based ESX hypervisors to perform its virtual machine replication duties, by creating/scp-ing/commiting snapshots. This and a legendary dose of bureaucracy kept things asleep for nearly a year.
After I passed my VCP I prepared the whole plan for upgrading our infrastructure. Submitting my resignation letter in the end of June speeded up things on the administrative side and I was able to convince my colleagues that I would take care of the upgrade alone. I will not speak of the Veeam implementation here as our management hired a consultant (incidentally my VMware Certified Instructor who trained me on ICM5 course, pure luck!) to carry on this task.
Our environment
- 1 vCenter server running as a VM w/ SQL Express 2005
- 3 ESX 4.0.0 hosts, 2 in a cluster in our Production site, 1 in DR site
- EMC Clariion Array in Production, Direct attached storage in DR
Challenges
- no downtime for running virtual machines in our production cluster
- migrate the ESX hosts to ESXi without having to rebuild them
- ensure no data is lost
- maintain the vCenter configuration (HA, DRS..)
- reconfigure multipathing on our storage array and on the ESXi hosts
- remote migration of our ESX host in DR site
There was only one problem with the whole upgrade, and a noteworthy one. Due to the small size of our environment (initially we were running on ESX Infastructure 3.5 Foundation) the vCenter server database was initially provisioned on Microsoft SQL Express 2005. When running the installation of vCenter 5.0 U1, I kept getting an error related to the database. Checking out the install logs showed that setup wasn't able to allocate free pages during the DB structure upgrade, meaning that the DB was full. Looking up in VMware KB, I found that KB1025914 explains how to deal with this issue and free up space. There are MSSQL scripts at the end of the KB article that can be downloaded and run on SQL Management Studio Express. Unfortunately any attempts to purge data didn't free up enough data and attempts to shrink the DB didn't provide much success either. Because I had limited time and that our vCenter configuration is fairly simple, I decided to skip the advanced troubleshooting options detailed in the KB and went with a DB backup then configured everything on a fresh DB.
Ten minutes after, the whole environment was reconfigured as it was before. To avoid memory overcommitment I powered off test VMs and low-tier production VMs to ensure all critical VMs could run properly on one host in the production cluster when I would force migrate the other host.
As one has come to expect with VMware products, everything went on without any issue. I made sure to successively put hosts in maintenance mode, power them off, then disconnected the FC adapters to ensure any involuntary loss of data (business would be happy…) and finally booted up the ESXi 5 boot ISO. A few minutes after, the first host was successfully migrated. I made sure to check that the service console switch was migrated to a management network, and enabled the vMotion checkbox. I then proceeded to shut down the host, reconnected the FC adapters and booted up, everything showed up perfectly so I repeated for the 2nd host, vMotioning everything back on the fresh ESXi 5 host.
Once this was upgraded, I enabled DRS and selected the most aggressive setting to rebalance the cluster, while powering on low-tier and test VMs. In my environment, VMs do not compete for RAM or CPU cycles therefore the default DRS treshold does not automatically rebalance the cluster since it does not see any valid reason for improving performance.
Remote migration of our DR site ESX host through Integrated Lights Out also went on flawlessly…. almost 🙂 If you use HP iLO2 Advanced Remote Control with the browser plugin, you already know that pressing F11 will not send the commands to the ESXi install program but will only maximize your browser window. Just launch the java version and you'll get rid of the problem!
The last step was to reconfigure multipathing behavior on our array and on the ESXi hosts attached to it. EMC recommends to use ALUA failover for Clariion CX4-120 arrays running FLARE 26 and above then configure ESXi 5 hosts to use Round Robin. Using the failover wizard on Navisphere, I configured Failover from mode 1 to mode 4 (ALUA) for both ESX hosts, then I went on to configure all LUNs to use Round Robin instead of what we had been using since the prehistoric era of ESX 3.5 (Failover mode 1 on the array and MRU on ESX hosts).
I've left the fun of upgrading VMware Tools and Hardware version from 7 to 8 to my colleagues and since I'm a good guy I've even installed the latest Update Manager to ease them up the work.
Overall the migration went on flawlessly and was very successful, except for the issue with vCenter Server DB. Looking back, the only drawback was having to use a fresh DB, but considering our tiny environment it would have been counter-productive to spend so much time troubleshooting the DB issue. This job was carried out over weekend, there was no failure or service interruption of any kind and as usual no users noticed any kind of change when they came to the office on Monday.
Feel free to share your comments about this article and contact me on twitter!