During Tech Field Day Extra in Berlin last week (at Cisco Live Europe) we had a fantastic and unique opportunity to meet with delegates who are working in various fields. We all know of some general concepts but well, we’re all so much into our specialization area that we do not foresee the fact that other individuals (which are also highly technical) may not fully understand what we talk about.
Since I’ve spent most of my time lately working with hyper-converged solutions, I thought I would make a post out of it.
It all begins with convergence
Hyper-convergence derives from convergence. The initial idea was to converge various technologies and integrate them into a single solution. Before converged solutions, customers would generally buy off-the-shelf, « best of breed » dedicated solutions such as compute (read x86 servers) from one vendor, storage from another vendor, network switches from yet another vendor, and whatever else (backup solution, fibre channel switches, etc…) from the vendor that would fit them best. Usually a more or less long amount of time and research would be necessary to verify interoperability between these components, configure them, make them work together etc. Some vendors came thus with the idea of converging these items into a single stack that would integrate various parts of a solution into a form factor where integration and testing between the components has been thoroughly performed in order to maximize performance and eliminate potential compatibility issues. The emergence of so-called converged switches also played an eminent role into this. Converged switches allow to leverage ports that can be configured either 10 Gbps Ethernet, 8 Gbps Fibre Channel or 10 Gbps Fibre Channel over Ethernet. This convergence at the connectivity & networking levels allowed for an easier integration of the components into a single solution and thus were born integrated solutions such as vBlock, FlexPod, VSPEX and the likes. In addition to this, efforts were made to reduce the network devices footprint by virtualizing network switches (as virtual appliances).
Real integration?
When we speak about integrated solutions, we have to take this with a grain of salt. First of all, the convergence that we see is mainly in the fact that Fibre Channel connectivity and network connectivity are possible using a ‘single’ hardware fabric instead of having separate network and FC switches but we still have to deal with separate physical components for compute power (x86 servers equipped with CPU and RAM), network switches (for management purposes), converged switches (for interconnecting storage and compute) and finally storage arrays. Indeed we could extend the list also to specific backup appliances. The integration therefore isn’t as much physical as it is logical. And when we say logical, we don’t mean here as a single unit or logical entity, but rather as a predetermined configuration (or bill of materials) that has been previously thoroughly tested and approved. Indeed any untested/unapproved deviation to the predetermined configuration might work or might cause incompatibilities.
We must also separate real converged infrastructures, where the « building blocks » answer to a rigorous bill of materials sold under a single SKU and supported by a single vendor to reference architectures, where the vendors engaged in the solution do publish a recommended BoM, but where the customer needs to purchase items separately and may assemble them with or without the help of a vendor. Deviations to the recommended reference architecture, although potentially not supported, are possible. On a converged architecture, oppositely, such deviations are not permitted and are subject to a review from the vendor.
The challenge of convergence
Are there any limitations with convergence? This may be a brave or foolish statement. Converged solutions work. When properly engineered and deployed, there is no reason they wouldn’t work if from end to end the solution was tailored to fit a customer’s need and if the proper sizing was performed including the much ignored aspect of IOps (I/O operations per second). One of the challenges with converged solutions is among others the time needed to build and deploy, as well as the complexity involved in the solution. Generally speaking, converged systems are scalable but within certain boundaries only, based on the model you have chosen. Due to the size of converged solutions and the environmental requirements (weight, cooling, power, rack space), converged solutions are generally built in advance to fit a given need and with extra overhead factored in from day one. This is acceptable for a majority of customers. Is it however possible to rely on converged solutions when a given customer is, as we say in the industry, « operating at scale »?
Different customers, different approaches
First of all, what does it means to « operate at scale »? This term was sort of invented to designate industry operators who operate at a massive computing scale (think of Facebook, Google, Amazon and the likes) and who incessantly need to expand their compute and storage capacity. This massive need makes them operate in a tense model where compute and capacity need to be added almost constantly. The meticulous work of sizing, planning, building to spec and delivering doesn’t fits this operating model. Instead, these operators rely on commodity hardware that is readily available off the shelf – and lately, they have grown so large that they even likely have x86 hardware built to their specifications. The advantage for them is that this commodity hardware can be leveraged nearly anywhere.
To operate this infrastructure at scale, these operators rely on various techniques: those could be virtualization, but also distributed applications that are built to be fault tolerant in case multiple nodes would fail. Besides compute, they also need to factor in storage requirements and here again they could either go for massive, high end storage arrays delivering petabytes of capacity but prone to limitations such as the number of ports to which the large amount of x86 servers can connect, not to count also with the time needed to size, order, build, install and put these storage array into operational use. To tackle this challenge, these gigantic industry operators developed in house distributed filesystems built to support the shift from large, dedicated storage arrays to a vision where a massive amount of x86 servers with local-attached storage have their entire storage capacity pooled and aggregated with data distribution and fault tolerance mechanisms. The Google File System (GFS) is one of these implementations, although specific to Google own needs.
Hyper-converged expands to the enterprise world
With this experience in mind, engineers coming from Google, Facebook and the likes saw an opportunity to extend this experience to the land of common mortals, i.e. the enterprise world. The commoditization of general-purpose x86 servers and the relative « cheap » cost of SSD and HDD drives would make an ideal platform to create a distributed filesystem controlled by a software solution. From this concept, several solutions built by different vendors have emerged. While the features and implementation of the file system (and the capabilities of the solution) differ, most of the available hyper-converged solutions are able to run starting with three nodes and all of these promise linear scalability and seamless expansion of the compute/storage capacity without incurring any downtime to the infrastructure and apps running on it.
Why hyper-convergence makes sense
An advantage of hyper-converged solutions is the reduced footprint in your environment. Due to the integration of storage and compute within an industry standard x86 1RU or 2RU chassis, you are left with plenty of space available in a standard 42 RU rack for further expansions. This compact footprint plays a role not only on the rack space but also on other factors such as weight, power and cooling. Another advantage is the capacity to « pay as you grow » instead of planning upfront for your estimated storage (and compute) capacity over the next N years. While it’s true that a storage array is expandable, you are asked to choose upfront which model you will go for and with the model you select you may be constrained in your expansion capabilities. Take a large model and you may end up paying too much and underutilizing. Take a smaller model and you may hit a hard limit in the number of shelves (and kind of drives) you can expand. To me, the hyper-converged trend makes sense and regardless of the vendor you may go for (and the way each vendor resolves the data distribution, data locality and data protection challenges on their own hyper-converged solution) the trend is here to stay.
What you should pay attention to with hyper-converged solutions
While hyper-converged infrastructures certainly reduce the footprint on the storage aspects, it is worthwhile to reminder that the burden that converged solutions carried in terms of connectivity (be it 10 Gb Ethernet, 10 Gb FCoE or 8/16 Gb FC or whatever else we’ll get soon) does not fully disappears. While hyper-converged solutions seamlessly integrate compute and storage, they do however relegate the connectivity aspects to the customer, who has to provide their own network backplane for connectivity. Some vendors claim that they can leverage 1 Gbps for smaller sites, that may be true but for production workloads you should consider 10 Gbps connectivity to ensure you do not hit a bottleneck neither with the VM traffic, nor with the replication traffic that happens between nodes. And whenever you say 10 Gbps, that may mean costly switch upgrades on your data center(s).
With 10 Gbps, you also have to factor in the cost of optical connectivity, SFP+ modules etc. This statement is indeed tempered by the fact that in some cases you could leverage 10GBase-T (copper) connectivity, in others you can use Twinax cables which allow to leverage SFP+ connectivity over a copper cable. Twinax cables are perfect for short range connectivity up to 10 meters (passive cables up to 5m, active cables up to 10m). If you intend to leverage multiple 10 Gb ports per node, Twinax may come in handy. There is also this new standard being proposed by the NBase-T alliance (I was speaking about it in my TFDx posts) but it will be a year till the standard is fully available. However some Intel network chips already support NBase-T (ability to run 2.5 Gbps or 5 Gbps over existing copper infrastructure) and it might be that a pair of Catalyst 3850 switches may be cheaper than a pair of 10 GbE switches. Here again, your mileage may vary based on the network vendor you vouch for. There are some cases available where a leaf and spine architecture was used to deploy a prominent hyper-converged solution, but I may not venture further, as network is akin to (black) magic to me.
Another topic that you have to factor in with your hyper-converged solution is determining your requirements in terms of fault tolerance (replication factor, failures to tolerate) and for some solutions also the way you will build/place your nodes (determining what the fault domain is). Some solutions allow for erasure coding on mixed systems (SSD+HDD), others on SSD only.
Are hyper-converged solutions the only valid answer for data centers?
Obviously no. Hyper-converged solutions are excellent at providing scalable compute and storage: they may be a good building block starter, however you may be faced with user cases where common sense dictates that you should rather look at a dedicated storage solution. You may already have more than enough compute capacity, but you are looking only at expanding storage. You may require a storage solution with capabilities that are not available on an hyper-converged solution. You may also be a customer who has asymmetric needs: your storage consumption far exceeds your compute requirements. Or you may have needs to manage a multi-tenant environment where you must clearly segregate not only each tenants data, but ensure that each tenant gets what they pay for, not just in terms of compute and storage capacity, but also in terms of allocated/consumed IOps. I believe that hyper-converged is an elegant solution to address multiple user cases, but not all user cases.
Final thoughts
Hyper-convergence is the latest hype. Every solution has its pluses and minuses and employees from any solution vendor are very active in spreading their gospel, which unfortunately means that you’ll hear claims from all involved sides, some legitimate, some not. Like with any solution you should carefully weight the pros and the cons, and you should avoid buying into FUD (Fear, Uncertainty and Doubt). Stay clear of benchmarks, or rather consider them as just one other factor: do you purchase your car solely based on the time needed to go from 0 to 100 km/h, or do you check for other attributes? Test the solution yourself or request a PoC. Have a look not only into the fancy features, but into the manageability of the solution and the security aspects as well.