It’s not the first time I cover the topic of data management. In the past, I covered on my blog some of the challenges encountered with data management, specifically the ability (and necessity) to properly define what is the scope of data management.
A Multiverse of Data Management Realities
There is a multiverse of data management realities, all different based on the context and scope. Data management will mean something very different if I talk to an IT user or to a business user.
The Business User Perspective
People working in legal & compliance will be interested in topics such as legal hold, retention periods, and data classification. Regarding data classification itself, this usually covers the broad range of documents produced within an organization, and the responsibility of applying the correct classification usually relies on the user. They are to determine if the content created are public records, internal business-grade information or sensitive records.
Certain organizations & corporations will usually have to comply with external regulatory bodies. Organizations listed in US exchanges will have to comply with SOX (Sarbanes-Oaxley Act) regulations, those handling credit card payments will have to comply with PCI Security Standards. Healthcare & finance industries (to cite just a couple) will also have to comply with their own national or international regulators. If you don’t know about these, it’s an interesting rabbit hole to dive in, but I can’t promise you’ll get out of it undamaged.
The IT User Perspective
Now go talk with an IT user and you will get a totally different perspective about what constitutes data management. To them, data management is very often about what type of data needs to be placed on which part of the infrastructure stack.
Although business requirements very often influence the storage infrastructure & architecture landscape, the assessment criteria are different and may comprise of (to cite just a few):
- underlying storage device type (block vs. file vs. object)
- workload type (latency driven vs. capacity driven)
- workload criticality
- service level agreements
- backup and archival requirements
Because infrastructure has a cost and capacity is finite, a blend of business requirements and data type evaluation criteria will be used to determine not only where the data will reside, but also which infrastructure components will be used to store the data (and also, which data processing / management attributes these will have).
Based on the assessment criteria above (which I repeat are just an excerpt), an organization should determine which data qualifies for primary data and secondary data.
Because there is no common agreement of what data management is, and because every organization will have their unique set of business & technical requirements, what qualifies for primary data and secondary data will vary.
Primary and Secondary Data
As we’ve seen above, in the IT context data management relates to where data is best stored. Without getting into the muddy waters of what constitutes data archival (as opposite to long term retention), we usually see “primary data” and “secondary data”.
Primary data can be seen as an organization’s live data set , i.e. all of the active data used and generated by its users & core business systems to support its operations.
In contrast, secondary data can be seen as the byproduct of primary data. It can be either data outputs generated by primary data for further consumption, or it could also be inputs that are used to fuel the primary data production processes. In either case, this data is not used actively but kept for a variety of reasons.
Assessing what constitutes primary and secondary data is sometimes fuzzy, because the final usage of the data may be in contradiction with the application architecture & its data storage requirements.
There are interesting solutions available on the market that help assess whether data is active or not, but ultimately the analysis & data movement capabilities require a coherent data management platform to find its value.
Data Management in the Context of Secondary Data
Usually, data management for primary data is a no-brainer. The challenge pops up when handling secondary data, because it has different formats and different use cases. And because of this, it has traditionally been a complex and siloed area.
The most common secondary data use cases are as follows:
- Data Protection
- File & Object Storage
- Test & Dev environment (data copies)
- Data Analytics
Now data management is getting increasingly popular with many vendors, because it gives them not only a new shiny buzzword to use, but because it also helps them eventually to be perceived as doing more than their usual market segment.
A data protection company can derive more value by offering a “data management” solution than just handling backup copies & eventually restoring data as well. Several companies recently pivoted from their traditional offerings to offer data management solutions.
It remains to be seen who is best placed to win market share in the data management space. In that regard, I have a pretty solid opinion : looking among all of the vendors who brand themselves as data management companies, only Cohesity hits the mark.
The Cohesity Data Management Platform
I’ve written in the past about Cohesity, but that was a long time ago (January 2017!) and they’ve made a lot of progress since then. For reminders, Cohesity was founded in June 2013 by Mohit Aron. He and his team have been ever since on a mission to solve mass data fragmentation and build a platform to resolve the hurdles of managing multiple silos of data.
Their solution, the Cohesity DataPlatform, has addressed the use cases of data protection, file & object storage, test & dev as well as data analytics for a long while now. It would be a tedious task now to cover the full SpanFS (Cohesity Filesystem) architecture, although it is based on SnapTree, a technology that I covered in my January 2017 post.
Here are some of the Cohesity SpanFS attributes:
- Strict write consistency
- Multiprotocol (SMB / NFS / S3)
- Global deduplication (within a cluster boundaries)
- SnapTree limitless snapshots & clones
- Self healing
- Automated tiering
- Multicloud support
- Multi-tenancy with QoS
- Global indexing and search
From a location scope perspective, the Cohesity DataPlatform can be consumed in three form factors:
- Edge: Cohesity DataPlatform Virtual Edition – a virtual appliance relevant for RoBo and Edge use cases
- Core / On-premises: Cohesity DataPlatform – the “traditional” Cohesity hardware appliance, whether on their own appliances (C2000 / C3000 / C4000), or through technology partners Cisco and HPE
- Cloud: Cohesity DataPlatform Cloud Edition – a virtual appliance that can run on the major public cloud providers: AWS, Azure and Google Cloud
The glue that binds all together is Helios, Cohesity’s global management environment, which can seamlessly manage all of these three consumption models. Again, it would take a series of articles to cover in detail each of those components.
From a support perspective, DataPlatform supports a broad variety of workloads. VMware ESXi, Microsoft Hyper-V and Nutanix AHV are supported, but it doesn’t stops at the hypervisor level. Physical servers and NAS are supported too, as well as major databases (a lot of engineering effort was put in ensuring that Oracle, SQL, Exchange and SAP Hana would be supported). Finally (and obviously), cloud workloads are supported as well via the DataPlatform Cloud Edition virtual appliance.
An App Store for Data Management
In the early days of Cohesity, it was possible for customers to execute custom packaged code that they would eventually develop themselves. I can’t remember well but these could have been jar (Java) packages.
It was evident since the early days that Cohesity would eventually either develop their own applications or would allow for external third-parties to develop applications that would tap into the potential of the platform. I can remember the wide smile and vivid look of my Cohesity friend Joe Putti (from the Product Mgmt team) when he first showed me the initial feature allowing to run custom jar files, I thought there has to be more to it!
Moving fast forward to 2019, Cohesity have built an SDK and a developer portal that allows 3rd parties to develop apps specifically to run on the Cohesity DataPlatform.
Here is a list of inaugural third-party apps:
- Splunk Enterprise – yes, *that* Splunk, but running on Cohesity DataPlatform
- SentinelOne – this is apparently a threat prevention / antivirus that is allegedly powered by AI or Machine Learning
- Clam Anti-Virus – An open-source antivirus that can also execute directly on the Cohesity DataPlatform
- Imanis Data – A data protection solution for Hadoop / NoSQL workloads
And here are the apps developed by Cohesity themselves:
- Insight – according to Cohesity, “this application uses a powerful index to search data as it is stored on the Cohesity DataPlatform, helping customers easily locate and take action on their data for compliance, legal, or day-to-day business needs”
- Spotlight – This application monitors modifications made to files or data and can understand through anomaly detection patterns whether malicious activities are taking place, such as malware attacks, data breaches or rogue behavior. Spotlight augments the DataPlatform with logging capabilities and the ability to create alerts about data modifications if required.
- EasyScript – This is an app that helps with uploading, executing and managing scripts and in general helps make users’ life easier (at least for users who can code at all)
Market Positioning & Financials
In June 2018, Cohesity secured a $250 million D-Series funding round led by SoftBank (a Japanese bank). That brings the total VC money poured into Cohesity at $410 million. Data Management is a highly profitable market segment; the current overall spending for disparate & siloed data management solutions is currently estimated to be in the range of $60 billion.
Since Cohesity isn’t a public company, we cannot get an insight at its numbers besides press releases. The company claimed on 21-Aug-18 a +300% YoY growth in revenue for Fiscal Year 2018 (Cohesity’s fiscal year runs from 1-August to 31-July).
Cohesity also received a bunch of awards in 2018 (including World Economical Forum, for what it’s worth) and Mohit Aron, their founder, also received a load of accolades from various organizations. Clearly, there’s a lot of good momentum for Cohesity, and one can almost see the perfectly white polished teeth from big investors ready to seize their bite.
The company size has greatly increased employee-wise. I like to joke about the fact that every 12 months, Cohesity has been renting 2 or 3 additional floors at their San Jose HQ building to accommodate their growth needs. Incidentally, they also announced in December 2018 their intention to hire 400 additional resources.
In my view, Cohesity has reached a maturity point and the company now needs to weigh their options. The company seems to be highly profitable, and investors will be looking for a return on their money.
The next 12 months will be exciting. I’m inclined to believe that Cohesity may file for an IPO, i.e. becoming a publicly traded company in a stock exchange. It will be an incredible opportunity for investors, and one that makes me regret of not being into the VC business.
All indicators seems to be in the green. Past growth has been tremendous, and the company has an excellent product that is backed by a consistent vision & strategy. Finally, the solution is addressing a market that is not yet saturated, so there is plenty of space for growth, as well as the ability to make incursions in the Data Protection market.
Ultimately, doing pure play Data Protection for Enterprise IT will become more and more difficult unless the vendor offers real data management capabilities and intelligence on top of the metadata, and that’s where Cohesity has a winning card.
Surely you’ve heard of the beautiful Maldives Islands in the Indian Ocean. I’ve never been, but it looks like a beautiful place to be. And yet, this country has its own problems that are hardly known by tourists. One of those problems is Tilafushi, known as Maldives’ garbage island.
I haven’t turned into a tourist resort review site yet, but the point is that our IT infrastructures also have a kind of data landfill problem. It’s ugly, it stinks, and nobody wants (or can) take care of it, citing many different issues and challenges.
Perhaps it’s a long detour to claim that managing data is like managing garbage / landfill, but surely there are parallels that can be seen. Not all secondary data is unwanted garbage. There are insights that can be learned from analyzing secondary data, just like there is value from recycling waste.
As a blogger & industry analyst, I’ve been following Cohesity since 2016. I’ve had the opportunity to talk numerous times with their product management people. What struck me from the very early days has been the consistency of their vision, the clarity of their goals, and their drive to execute aggressively and coherently against their vision. As long as Cohesity maintains their focus on secondary data hurdles and builds the right partnerships, we can expect to see continuous product improvements and new features.
From an IT standpoint, the key to leading the data management market is to understand and address data fragmentation. Not only that, but it’s also critical to have a valid data / data management story when talking with customers. As far as I know, there are only two companies in the market who have such a story. And if we focus on secondary data, Cohesity is -at least to me- still the undisputed leader in that space, and the gold standard used to compare other offerings.
Make sure to check the related Storage Field Day 18 videos from Cohesity:
Cohesity Under the Covers: SpanFS
Deep Dive on Scale-out NAS
Cohesity Comprehensive Data Protection and Compliance
Cohesity Leveraging Your Backup and Unstructured Data
This post is a part of my Storage Field Day 18 post series. I am invited to the event by Gestalt IT. Gestalt IT will cover expenses related to the events travel, accommodation and food during the event duration. I will not receive any compensation for participation in this event, and I am also not obliged to blog or produce any kind of content. Any tweets, blog articles or any other form of content I may produce are the exclusive product of my interest in technology and my will to share information with my peers. I will commit to share only my own point of view and analysis of the products and technologies I will be seeing/listening about during this event.
All Storage Field Day 18 delegates received a gift from Cohesity in the form of a sports jacket with custom embroiderings – see this post for more detail. This gift has no influence on my opinions about Cohesity.