At VMworld Europe 2016 I had the opportunity to stop by Cohesity’s booth where I had a long technical discussion with Joe Putti (Director of Product Management) followed by a demo. I’d like to thank Joe for his time and his patience, as well as Patrick Rogers (Head of Marketing and Product) for making this possible. Following VMworld, Cohesity also announced during Tech Field Day 12 the release of their latest version with a consequent batch of improvements. This article thus builds upon my notes from VMworld and what I was able to ingest from the TFD12 videos. If you think Cohesity is just another backup solution or just a backup target, I urge you to read on – Cohesity is into the secondary storage business, and we’ll talk about this segment.
Company Presentation
Cohesity is a startup based in Santa Clara in the Silicon Valley and founded in 2013 by Mohit Aron, who also happens to be the co-founder and former CTO of Nutanix. We can therefore tie Cohesity to the “Nutanix Generation” of startups which have each been quite innovative in their own field.
Cohesity have so far raised 70 million USD in two rounds from ten investors, the last round (B) being in May 2015 for a total of 55 million USD. Cohesity’s mission is to eliminate secondary storage silos and provide an all-encompassing approach to the multiple use cases that secondary storage targets (see the chapter below for a short introduction to secondary storage).
Secondary storage is a market with a much larger need for capacity than primary storage. In fact, up to 80% of all the data generated in the world qualifies for secondary storage and so far there is no established leader in the secondary storage market, which is a massive opportunity for Cohesity. How and why? Read below!
What is secondary storage and why should I care?
A common analogy in describing primary storage vs secondary storage is to consider an iceberg. What we look at when we speak of primary storage is the tip of the iceberg, i.e. the visible, emerged part. There is however more to an iceberg than its visible part – in fact, the underside of an iceberg (its submerged part) is not visible to the eye, and considerably larger than the visible part. The same applies between primary and secondary storage. Primary storage mainly designates the performance storage tiers used to run business critical workloads, while secondary storage is the sum of all the data that is moved off from primary storage or that never reaches primary storage: file services, backups, archives, test/dev workloads, unstructured data (object storage), big data analytics. Some like to call data stored on secondary storage systems “dark data”. This can be a way to put it, since there is a lack of visibility and transparency in how secondary data is handled, but also (and mainly) a lack of awareness about the very existence of such data and the potential it eventually holds.
Because of the sheer amounts of space required, this data cannot be stored effectively (from a cost standpoint) on primary storage arrays. All we had until recently were specialized solutions, such as NAS or scale-out NAS for file services, object storage solutions for untructured data, the whole folkloric band of backup systems (media servers, master servers, tape libraries/VTLs, backup targets, physical backup appliances, virtual backup appliances) and finally systems that integrate with cloud storage solutions or act as gateways to these services (Amazon S3/Glacier, Azure Blobstore etc.).
It doesn’t takes an eagle eye to quickly grasp that such diversification comes at the high price of complexity, increased management constraints and data fragmentation. Those constraints bring with them their own set of inefficiencies: data moved between different tiers/different platforms needs to traverse network at the cost of tranfer times and bandwidth consumed. Data fragmentation also leads to more space being consumed: since the data is scattered across different device we are faced with the inability to achieve massive deduplication savings.
Needless to say, the larger an enterprise is, the larger their data center footprint is likely to be, and the more they are likely to be entangled not only with secondary data silos, but also with organizational silos, which are yet another driver towards complexity: each organization/silo usually settles for a specific technology regardless of what is in use in other silos, which increases in turn the inefficiencies highlighted above. Cohesity claims for example that 10 to 12 copies of the same data exist across silos. All of this means that investments are made in an uncoordinated fashion and without considering that the same data may live on many different tiers of storage which leads to CAPEX being wasted, while OPEX increases due to the necessity to manage multiple platforms, while also training and hiring personnel with different skillsets.
The Cohesity Hyper-Converged Platform
This is where Cohesity and their hyper-converged secondary storage solution come into play. Cohesity aims to provide a single platform for secondary storage which is able to address the requirements of file services, backup systems, object storage, archive and analytics, while also tightly integrating with the major cloud vendors. Let’s begin by stating that Cohesity’s DataPlatform aims to resolve such a broad spectrum of problematics that it would take almost a blog post per secondary storage area to cover them all – I will try to be short. At the core of the Cohesity solution is their Cohesity OASIS operating system, now in version 3.5. Version 3.5 has brought forth a full set of RESTful APIs, integration with VMware vRA/vRO and improvements in SQL Server / Oracle as well as physical Windows/Linux backups. Another important improvement is the addition of the S3 protocol i.e. the ability to leverage DataPlatform as an object store. At this time, I unfortunately lack details about this new feature.
What are the advantages of Cohesity’s solution? These are multiple, therefore I’ve attempted to make a little list:
Single Platform: Cohesity DataPlatform provides a single platform for all the secondary storage needs of a customer. Let’s start with the distributed scale-out architecture of DataPlatform, which brings global deduplication and an unlimited amount of snapshots/clones. That same data you were holding on n different devices? Gone – or rather, deduplicated now.
Distributed Architecture: the distributed architecture provides the same level of resiliency that hyper-converged customers are accustomed to: the architecture is designed to operate seamlessly even in degraded mode and mechanisms built into the Cohesity OASIS operating system ensure the consistency of data.
Backups: Cohesity is also a potent backup solution that keeps growing and maturing. VMware backups are fully supported through VADP. We spent a long time discussing about Cohesity backup capabilities, especially with native, agentless Microsoft SQL Server backups through the VDI Interface then with Oracle (Cohesity is part of the Oracle Backup Programme), often reaching beyond my knowledge in the area but Joe and one of his colleagues went to great lengths to explain patiently, to the point where I was positively impressed. Backups can also be used to instantly spin up clones for Test/Dev purposes. A word should be said about the ability to backup physical workloads (Windows / Linux). These can be stored in a VHD format (for Hyper-V / Azure) which enables P2V as well as DR to cloud scenarios, plus the ability to cut off from physical workloads at any time. I can’t remember if such P2V functionality exists also for VMware platforms. File Level Recovery is supported and for MS Exchange/Sharepoint, Granular Level Recovery (object-level) is also supported.
Cloud Integration: The native cloud integration with AWS, Azure and Google means that you can move data to a variety of cloud providers whether the goal is to leverage cloud storage for archival purposes or just as another storage tier. Organizations can adopt diverse cloud strategies, such as archiving (copy + long term retention), tiering (move data to cloud to free up the appliance – blocks are deduped i.e. not for replication use cases, no spin up possible), cloud replication (DRaaS) with the ability to spin up the workloads. About replication, let’s add that normal replication between cohesity appliances is already functional.
Workload Prioritization: built-in QoS allows to throttle jobs / workloads based on three priorities (High, Medium, Low).
In-place Analytics: installations where backups are managed by DataPlatform will allow to identify the owner of specific chunks of data, metadata can be used to search for files across an entire installation. I was shown an area in Cohesity’s sleek UI called the “Application Workbench”. This allows customers to develop their own applications (as JAR bundles if I recall properly) and run them on the DataPlatform. An example could be an app that looks at specific types of data an applies compression/reduction methods (think video/audio/images for example). I’m not an expert into this but the possibilities sound impressive.
The beauty of Cohesity’s solution is that you do not necessarily need to proceed with an integral “rip and replace” approach. You can begin with using Cohesity as a backup target or a NAS, then gradually expand to leverage other features as contracts with other vendors and/or support expire.
Architecture
Because of his previous work on the Google File System and his heavy involvement in the Nutanix hyper-converged platform deployment, it’s safe to say that Mohit Aron has embedded the spirit of distributed architectures and hyper-converged technologies into Cohesity’s DNA, where also most of the engineers have a history of coming from Google, Nutanix, VMware and other industry heavyweights. The result is a distributed scale-out hyper-converged platform that focuses on secondary storage needs.
Hardware Overview
Cohesity (like Nutanix and Rubrik) have chosen an appliance-based distribution model. From a hardware standpoint, Cohesity offers two appliance models (the C2300 and C2500) for which the main differentiator is the storage density (HDD and PCIe Flash). The C2300 offers 48 TB HDD + 3.2 TB Flash for a 4-node block, while the C2500 offers 96 TB HDD + 6.4 TB Flash, again for a 4-node block (the values provided are raw capacity). Each node is fitted with a dual Xeon E5 2600 series CPU (8-core, 2.4 GHz) – the exact model is unspecified in Cohesity documentation but likely to be an E5-2630 v3. The form factor of a Cohesity block is similar to a Nutanix block or a Rubrik brik, i.e. a commodity 2U chassis with two PSUs which accepts up to 4 nodes. Besides the C2300 and C2500 appliances, customers can also decide to run Cohesity solution with Cisco UCS hardware. Various sources also point towards the existence of a virtual appliance version but I was not able to find tangible evidence, at least of the possible existence of a “software-only version”. The virtual appliance is needed when a Cohesity cluster needs to be spanned in the cloud, however it could work in theory as well on any commodity hardware (in a move that can remind readers of Nutanix Community Edition).
Software Overview
The Cohesity solution comprises of multiple layers. At the bottom, the Cohesity OASIS Operating System (OASIS stands for Open Architecture for Scalable Intelligent Storage) provides core services such as Cluster Manager, I/O Engine, Metadata Store, Indexing Engine and the Integrated Data Protection Engine. While I could have gone into details for each of the services, it should be noted that Cohesity have produced a great Architecture White Paper which I recommend reading, and does a perfect job at explaining the innards of the solution.
On top of this layer, the Cohesity Storage Services layer provide the services that make the differentiation of the Cohesity solution. These services are Snapshots (through Cohesity patented Snaptree technology, a distributed B+ tree that allows to access data through a reduced set of hops), Data Deduplication (Cohesity uses variable-length deduplication blocks), Intelligent Data Placement (which reminds of Nutanix’s Replication Factors combined with Block Awareness).

A depiction of Cohesity SnapTree, taken from their Architecture White Paper
The third layer is made of Application services. This layer provides features usable by customers, the three applications offered for now are:
- Cohesity Protection – This application covers the whole spectrum of backup/restore / DR services which I had introduced in the previous chapter
- Cohesity DevOps – This application leverages the SnapTree technology and RESTful APIs to enable the fast creation of clones, a feature useful for test/dev environments
- Cohesity Analytics – Thanks to its comprehensive indexing of metadata and thanks to the metrics captured in the environment, Analytics provide a clear view of the Cohesity clusters status while also allowing to search for specific bits of data or to run custom-built scripts to be ran against datasets
Go-To Market Strategy
Cohesity’s approach (hardware appliances + software intelligence) will remind readers of Nutanix and Rubrik: a simple, scalable and powerful solution that is conveniently bundled in a 2U form factor. Customers purchase nodes with a given capacity and can grow as needed by purchasing additional nodes.
Cohesity focuses primarily on the Enterprise segment (80% of their customers). This is where most income can be expected for Cohesity and where the full potential of their solution can be achieved due to the sheer size of these customers as well as years of battling with complexity and entrenched legacy, out of date solutions. One of Cohesity strategies to enter new customers is through advertising the backup target capabilities of their platform, while presenting the vast array of possibilities to the customers, and then expand from there with backup services, and so on. One may argue that the price entry point is higher than most backup targets, the fact remains that a fully-fledged Cohesity deployment is likely to blow away what customers were accustomed to. It’s therefore up to Cohesity sales force and their partner ecosystem to choose the right strategy and messaging in approaching customers.
In terms of market itself, Cohesity have been very clear that their core focus is the secondary storage market. This decision is all to their favor as this is a fragmented landscape with no leader and better growth promises than the saturated primary storage market. It’s, in my view, a clever decision from their side.
Before moving to my personal opinion, one last point: those of you who have read my article on Rubrik may find some parallels. I’m not trying to put pears an apples in the same basket but undeniably both companies share a certain part of their DNA and are using the genius simplicity inherent to hyper-converged solutions to revolutionize the market segments into which they operate. The reader should be conscious that while Cohesity also offers a solution to the complexity of backup systems, it also goes beyond the backup problematic and aims to embrace and resolve the multiple challenges of secondary storage which where highlighted earlier in this article. It wouldn’t be thus fair to compare both products and solutions from the same angle.
Max’s Opinion
While the secondary storage market remains a frontier to be fully explored and taken over, I think that the industry actors need to bring awareness to CIOs and decision makers about the challenges posed by dark data. Ultimately, the challenge of siloed IT organisations still remains the inability to take an all-encompassing view about the data lifecycle in enterprises, from their inception on hot primary data tiers to their continued life on colder data tiers, until their cryogenization on services such as Amazon Glacier or their utter destruction. Considering that up to 80% of data generated in enterprises is created on fragmented secondary data tiers, there is a huge opportunity for cost savings at hand.
They have developed a holistic secondary storage platform that covers multiple use cases in the enterprise and Cohesity could become the “VMware” of data management. Through the use of a single platform, massive achievements can be reached. Data deduplication at scale is a striking example. The ability to manage diverse secondary storage workloads from a single pane of glass should not be underestimated either. As corporations struggle to manage the explosion of data growth, Cohesity could become the unexpected saviour that brings back hope to desperate IT administrators and puts uncontrolled & chaotic IT expenditure waste to a halt, allowing enterprises to refocus their investments on a perennial solution spanning across problematics and above the imaginary boundaries of organizational silos.
To finish, Cohesity is very well placed to become the leader of the secondary storage market: they have a strong technical offering, and easy consumption model, undisputable arguments and a market open for conquest. Will they succeed? I wish them so.
Further Reading
Some of my Tech Field Day peers have published articles on Cohesity:
- James Green – The Silent Threat of Dark Data
- Mike Preston – Cohesity bringing secondary storage to Tech Field Day
- Eric Shanks – Cohesity Provides All of Your Secondary Storage Needs
Disclosure
Cohesity offered a Timbuk2 laptop bag to 2016 VMware vExperts at VMworld, on a first-come, first serve basis. I happen to be one of the beneficiaries of this offer due to my VMware vExpert status. I was also given a battery pack and some goodies (pen, laptop sticker). These gifts did not influence my writing, also Cohesity has been on my radar for an article at least since April 2016. I am not affiliated with Cohesity, neither did they ask for me to write an article. Due to my frequent participations at Tech Field Day events, we could also loosely related this post to Tech Field Day 12 although I was not able to participate this time.