Size doesn’t matters is not something I like to hear in my data center environment, at least when referring to virtual machines. Size indeed does matter, and what matters most is for a VM to be right-sized. This article will cover oversized VMs and proper compute resource allocation for virtual machines. We will not cover the so-called “noisy neighbours”, although a bloated VM can sometimes also act as a noisy neighbour. For the sake of easy understanding, and also because I’m a noob suffering from imposter syndrome, I will speak only of compute resources i.e. CPU and RAM.
Why are oversized VMs a problem?
Oversized VMs are the plague of data centers because they are allocated way more resources than they need / consume. In doing so, they make less resources available for regular VMs and may cause illegitimate overprovisioning in your environment. The problem is that not only it affects performance of the VM itself, but it may also affect performance of the neighbours and cause overspending. Also, the larger your data center, the higher chances are that you will have many oversized VMs lurking around.
Let’s imagine an hypothetical environment, where an oversized VM is created on an host where other VMs are already running. Now let’s imagine that we are at a busy time of the year -financial closure- and all systems are running at maximum capacity, which leads to increased RAM/CPU consumption. Because of the oversized VM the host is not able to attribute the needed resources and has to kick in contention mechanisms. Contention occurs when compute resources are no longer available (i.e. no more free CPU or free RAM).
Now let’s imagine our oversized VM runs lazily, with a real usage that is only a fraction of what was allocated. Since the compute resources are allocated, the hypervisor has to rely on reclamation mechanisms anytime a contention situation is encountered. All the other VMs hardly consume RAM except in the EOY scenario. In our case, to allocate more RAM to the other VMs, ballooning will be triggered and the hypervisor has to leverage VMware Tools on the oversized VM to capture its allocated, but unused RAM and present it to the other VMs competing for resources. Resource requests will be satisfied but with a performance penalty will incur because the RAM is not directly consumable.
In terms of CPU scheduling, it can also get very ugly. Performance issues arise from the fact that virtualization is not properly understood. If you have a machine with 2 vCPU, the hypervisor scheduler will look for a moment when 2 physical cores are free and will allow the instructions of the VM to execute on these 2 physical cores. If you have 4 vCPUs, the scheduler will have to wait for 4 concurrent cores to be available. The more VMs you have -and the more vCPUs each of these VMs have- the more difficulties the scheduler will have to obtain a window where all the required number of physical cores are available. If you have 4 vCPUs and the scheduler can obtain a window for only 3 out of 4 physical cores, a co-stop will occur and the VM will wait for all 4 physical cores to be available in order for the instructions of the 4 vCPUs to execute. In practice this will lead to low performance and indicators showing that the VM is not doing anything – because it is in fact waiting! In terms of best practices, we try to limit the amount of VMs which have more than 8 vCPUs, but your mileage may vary.
How did we get there?
Why do we have oversized VMs? Should we blame it on the user? Yes and No. Everybody’s responsible.
First of all we have to put the blame on ourselves (infrastructure folks) for not having the proper controls and processes in place to detect and review these incoming troublemakers. We can also blame our predecessors for not having done the hard work but that will not help you in any way. If you are new to an environment that you must support in production, it’s worth engaging all the necessary teams to understand how VM creation requests are submitted, processed and approved. This will create less trouble down the line.
Secondly we have to put the blame on many application vendors for still providing, in 2016, configuration prerequisites dating from the all-physical era. Shall we keep reminding that nearly every major corporation has a virtualize first policy in place since at least 2008-2010 and that at least 75% of the workloads are virtualized? It’s time for them to wake up and realize what world we live in. Let’s not forget application/system integrators who do not perform their due diligence when they are also involved into an application design.
Thirdly, application engineers and application architects should also be blamed for a lack of critical thinking and for blindly accepting any requirements sent forth to them by the vendors. One of the very first questions they should have is to ask whether these specifications are meant for physical environments or virtualized environments.
Finally, to avoid public shaming, finger pointing and eventually also enterprise politics, the best course of action is to involve the infrastructure teams in any project for advice upfront, from week 1 if not from day 1.
How to address oversized VMs
The process of assigning optimal resources to a VM is called Right-sizing, which is nothing else than proper allocation of resources based on VM requirements and effective consumption. Right-sizing a VM will require an analysis of the VM load and consumption of compute resources. If you use vCOps or vROps in your environment you should be able to leverage existing reports or build your own. VMs with many vCPUs, low CPU usage and a high amount of co-stops are generally candidates to right-sizing. Similarly, VMs with low memory usage and high amount of memory unassigned are also candidates, caution must however be made in both cases to care not just about immediate data, but also about historical data. Some VMs may behave idle most of the day then peak for 2-3 hours at night when data batches are received and processed. Caution and good judgement must be used before committing to any decision.
However, the biggest problem in right-sizing a VM isn’t the analysis and data correlation. It’s convincing a customer that it needs to be done, that it makes sense, and that they won’t be penalized. It’s never easy to go to a customer and explain them that their VM, which was provisioned one or two years ago, which went through the provisioning and approval process without a single comment or concern, is now a problematic VM. You are now responsible for the mistakes or negligence of your predecessors and there’s really nothing else than take the blame and move on.
You are likely to face fierce resistance from a customer if they are not experiencing any performance issues with their bloated VM. If they however have been experiencing problems, after some resistance you can get down to the hard facts. And the cold, hard facts can be that the VM is slow and not doing anything because you’re having too much co-stops.
Preemptive strikes
Mature organizations may want to have a VM right-sizing / recertification in place to ensure that unused resources are reclaimed. It makes also sense to tie resource consumption with proper charging mechanisms. Penalties for oversized VMs might be envisioned, but it may be subject to internal politics.
If you want peace, prepare for war. If you are lucky to work in an environment where automation and self-service provisioning are in place or where this is being run in pilot/planned, the creation of templates which are right-sized will significantly decrease provisioning of oversized VMs.
Why it’s important
It’s important to reduce the footprint of oversized VMs because you will achieve at least the first two things in the list below:
- reducing the pressure in overpopulated clusters, which will in turn reduce potential contention situations, which will in turn reduce potential tickets and unhappy customers
- this unique peace of mind moment when we infrastructure people know we have done our work properly and have acted courageously to do the right thing
- by reducing pressure in overpopulated clusters and in general you can also directly or indirectly cause a cost-avoidance scenario by eliminating the need to invest into additional compute or by further delaying this investment
Final comments
Our data centers are living organisms populated by treacherous and unforgiving pests. Not only we have to deal with VM sprawl, but also with oversized VM sprawl. Contain the sprawl by putting in place control mechanisms and review processes for oversized VMs. Treat the infected organism by identifying, analyzing and applying the cure. Wash, rince, repeat – or better, see if implementing automation and self-serviced provisioning isn’t a better solution.