Update 18-Mar-19 – Please read this article with caution. Parts of it have been superseded due to clarifications received from Cisco and are kept only for historical reasons. The article containing the updates is available here.
Sections in blue have had additional commentary following the highlighted text. Sections with incorrect or incomplete information are in
strikethrough red,with comments in blue.
2019 marks my 4th attendance to Cisco Live Europe. It also marks my 4th coverage of Cisco HyperFlex. I’ve been critical in the past about HyperFlex – but, to my defense, also honest. One would think: if this guy cares that little about a technology, why write on it?
Well, the folks at Cisco want the folks at Tech Field Day to bring folks like me at these events to get our feedback and our opinion. If you want the full genesis about my HyperFlex coverage, check out my Cisco Live 2016, Cisco Live 2017 and Cisco Live 2018 articles. This will give you the background context of where I come from when sharing my thoughts about this solution.
What’s new in 2019?
Exactly one year ago, HyperFlex 3.0 was announced at CLEUR18. It was the first version that could be considered as production-ready. Fast-forward to January 2019, and we learn that HyperFlex has been augmented with the following features:
- Manageability: One-click upgrades were added, so that users no longer need to upgrade hypervisor, firmware and the HCI “controller VM” separately
- Capacity: New LFF nodes allow for greater capacity, thanks to that HX can be used as a Veeam backup target
- Security: publication of DISA STIGs, support of VMware Lockdown Mode as well as a new Tech Support mode
- Containers: support for Kubernetes deployments
- Critical workloads: SAP HANA certification was achieved; new HX GPU nodes are also available to support AI/ML workloads
- Cluster split-brain mechanisms have been implemented, leveraging Cisco Intersight (the management single pane of glass for HyperFlex)
On top of this, Cisco developed a new optional hardware component called “HyperFlex Acceleration Engine”. Let’s talk about it in a dedicated section.
HyperFlex Acceleration Engine
To improve the performance of HyperFlex, Cisco thought that would be awesome to leverage its expertise with ASICs/FPGAs. As a result, they have come up with an optional acceleration card for HyperFlex systems. This is truly a novel idea and surely nobody must’ve thought about it before.
What is this card about? It’s an optional purpose-built PCIe card that supposedly “improves the TCO of the solution” by offloading certain software operations and thus reducing the CPU load. According to Cisco, it currently helps reduce latency and improves compression and deduplication ratios, while saving costs on CPU cycles. Cisco envisions also in the future that this could be used to perform erasure coding, as well as crypto hashing activities, and so on.
Now, let me take a moment to deconstruct this and put things into perspective, no matter how unpleasant it could sound.
HyperFlex is running on the Intel x86 platform, like 99% of the other HCI solutions (Nutanix on IBM Power aside, but that’s another story). Each new CPU generation comes with a slight performance improvement, and core density per CPU socket increases with each generation. It’s an audacious statement to say that there is an urgent need for offloading cards when the $/CPU core ratio is declining. This is especially true if you model your infrastructure with mainstream processors. On top of this, all HCI solutions based on controller VMs take into account the need for 1 physical CPU core (usually 8 vCPU, which is not a big deal on an average 32-core host sporting a dual Xeon Gold 6148 setup), so this should be factored in and not a matter of great concern.
Don’t get me wrong, I don’t want to be sarcastic but you have to be wondering why Cisco is taking the path that was once taken (with variable success) by SimpliVity a few years ago. I have a hard time coming up with a decent rationale about why an acceleration card is needed, when the HCI industry has clearly taken the path of software.
All I can think of is that there is still considerable room for software optimization, but either the HX development team is not good enough at squeezing out the necessary performance in code, or this is a way to sell more hardware to resolve imaginary problems.
TL;DR: money and development cycles are being wasted on an ephemeral piece of hardware with little efficiency. Consider that HyperFlex has now 3,000 customers (meaning 1,000 net new customers since January 2018), it’s hard to imagine that these cards would sell by the thousands – but again I could be totally off track.
Bear in mind the card is entirely optional, so this could be a long rant for little benefit.
Updated commentary on blue sections above: Cisco states they will commit to SW optimization, while offering the card as an extra option for customers who need even better deduplication & compression outcomes. I’m still not a fan of optional cards, that said.
NVMe HyperFlex
Another big moment for Cisco is the launch of their first All NVMe HyperFlex solution, based on UCS C220 M5 servers. The solution features Optane (3D XPoint) cache and NVMe drives for capacity, with up to 32 TB / node. It wasn’t clear if this is referring to total capacity or usable capacity per node, and also how this capacity is reached relative to the media type used. It also hasn’t been stated what kind of NVMe SSDs are being used, but we can assume these would be 3D TLC NAND.
This is certainly laudable and we appreciate that NVMe is becoming more and more mainstream (see our study on The Solid-State Memory Industry In 2019), but what instantly raised an eyebrow was the claim of “up to 50% IOPS increase” – compared to the regular All-Flash SATA/SAS based HyperFlex.
This number just drives me nuts. The HX 220c M5 node is said to have Integrated Fabric Networking. To me, this means that some kind of optimized transport is being used (NVMe-oF?) instead of Ethernet (40 GbE networking mentioned). To be fair to Cisco, their slide on the matter states “Ongoing I/O optimizations” and “Further improvements expected with future SW optimizations”.
If you are claiming just a 50% increase in IOPS going from SATA to NVMe with “integrated fabric networking”, there are some red lights flashing just like my whole district has just gone on fire. This means that probably (giving Cisco the benefit of the doubt), the existing HX stack has just been transposed as is, without any optimization, except we now have a new server type and use a different bus. It’s like “here’s your Mercedes S63 AMG but we’re working on improving the motor, so we’ve put the engine of a Fiat 500 inside meanwhile”. It’s even worse than that. It reminds me of Ultima VIII: Pagan (check out the awesome review).
Rushing something out just to claim it’s on the market, at the risk of releasing a botched product is a very questionable strategy. Customers who purchase an NVMe product will want to have the performance now, not a few months down the line.
TL;DR: announcements at conferences are a great thing, but perhaps it’s better to work on the optimizations first.
Updated commentary on strikethrough sections above: Cisco did provide me with comprehensive explanations on the reasons behind the seemingly disappointing performance numbers. I encourage you to read my latest article on the matter which explains why the IOPS increase is just “up to 50%” in this initial release.
Max’s Opinion
Cisco has demonstrated over the course of the last 4 years its indefatigable commitment to the HyperFlex platform and to HCI. It has been able to grow its customer base to over 3,000 with 1,000 net new within a calendar year. It also added features year after year.
HyperFlex has brought in some indiscutably good aspects: improvements from a security and usability perspective, certification of SAP HANA, one-click upgrades and a couple more.
Nevertheless, my points of contention subsist. The first one is a questionable rationale for releasing an optional acceleration card, when performance improvements to dedupe and compression should ideally be addressed in code. The second one is what appears to be a seemingly rushed-out all NVMe implementation that doesn’t seems to take into account any I/O optimization. The NVMe implementation shortcomings should be considered as transitional and could be cleared out in a few months. It’s just not something I’d want to invest into upfront.
If we are to exclude those two contention points, HX is a perfectly valid HCI solution that should be (just like any other solution) evaluated adequately by anyone considering to implement it, looking beyond the technology itself and taking in consideration integrations and technological preferences. It could be a great choice for example for all-Cisco shops.
The question for me in the context of Cisco’s DC strategy is to what extent continued investments in HCI are relevant. While it is perhaps not yet “commoditized”, HCI has become the default consumption platform for x86 infrastructure in the context of Private Cloud / managed virtualization deployments. To Cisco’s defense, HyperFlex is their only storage platform unless they were to purchase one of the remaining “pure players” out there, so perhaps it’s easier for them to invest into HX than take the risk of buying a storage market contender.
HCI is a saturated market that primarily serves a legacy technology called x86 virtualization. I will cover this topic in another blog post (be patient!), but the estimated lifespan of x86 virtualization is another 5 to 10 years. It will continue to exist (just like we still have mainframes and Power systems) but it will eventually fade out.
Talking about “pure players” and strategy, we could converge (no pun intended) those two topics to say that for example NetApp has a great data story and strategy. Unfortunately, it seems to me that Cisco has a story about the future, but not a strategy around it. All I am able to grasp beyond buzzwords is the existence of products, not how they play altogether to build the story.
Disclosure
This post is a part of my TFD Extra at Cisco Live Europe 2019 post series. I am invited to the event by Gestalt IT. Gestalt IT will cover expenses related to the events travel, accommodation and food during the event duration. I will not receive any compensation for participation in this event, and I am also not obliged to blog or produce any kind of content. Any tweets, blog articles or any other form of content I may produce are the exclusive product of my interest in technology and my will to share information with my peers. I will commit to share only my own point of view and analysis of the products and technologies I will be seeing/listening about during this event.