This post is part of the blog series related to Storage Field Day 10. Find out the entire SFD10 content, presentations, articles, presenting companies and delegates here.
Yesterday on Wednesday, May 24th 2016 we had a deep dive session with the Primary Data team: Lance Smith (CEO), David Flynn (CTO) and Kaycee Lai (SVP Product & Sales). I’m new to Primary Data (later on shortened as Pd) and I think that their solution is interesting, but I’ll give my thoughts later on.
Per CEO Lance Smith, Primary Data’s goal is to “Create a global data space between flash, NAS, SAN an Cloud (object) storage and see if applications perform better on any of these types of storage without having to rewrite the apps.”
“Primary Data is doing for the storage layer what VMware did for the compute layer” – David Flynn, CTO
The Pd value proposition is to present a logical view to applications/VMs – and manage the storage layer on the background. Data mobility benefits are mainly about reducing costs tied with over provisioning of storage (i.e. purchasing more than is effectively being consumed), increase performance on demand for mission critical applications and break out from vendor lock-in.
A Data Virtualization platform?
Primary Data interpretation of Data Virtualization is to manage any data (files, LUNs, objects) on any storage, to have it movable at any time and to do so without interruption.
Their way of achieving this is by separating the control plane from the data plane as you can see in the picture below. The control plane could be seen as the brains of the solution and this functionality is provided by a component called DataSphere which controls the environment and manages the metadata. The data plane is managed by a component named “DSX” (ESX/NSX anyone?) whose role is to take care of any data write/read/move operations.
DataSphere is either a virtual or physical appliance that provides out-of-band management of the data virtualization infrastructure. DataSphere is scalable and could be, in Pd’s eyes, seen as the “One Protocol To Rule Them All”. Not sure if fellow SFD delegate Chris Evans agrees to their definition, though.
DSX is built on top of the Linux kernel component managing the NFS protocol and per David Flynn (and also to my limited understanding) it should be seen as a modified/improved version of the NFS protocol that incorporate performance metrics feedback. I’m taking risks at making the assumption that DSX is a driver, it could as well be a virtual appliance – Pd or anyone else may correct me here.
DSX natively supports Linux (i.e. KVM-like hypervisors) as well as VMware vSphere and Hyper-V. In the case of Hyper-V, customers can choose if they want to use the Pd-based SMB provider for optimal performance or if they prefer to use the standard Microsoft SMB provider (at the likely cost of performance loss). There seems to be plans to move DSX at the level of the ESXi kernel. The data is stored on the endpoints as BLOBs which have a tarball attached to them which lists the actual content of the data stored in the BLOBs.
Pd claims that their platform takes into consideration flash wear, I’m curious as to how they plan to truly achieve this, maybe some algorithms about data placement and feedback from the flash arrays? What about direct-attached flash storage?
True Software-defined Storage?
As Chris Evans and others rightly put it, nearly every vendor claims to be an SDS vendor but many fall flat. What is Software-defined Storage? Our attempts to define this yesterday in the first pilot video of the Gestalt IT Podcast were also not so successful. To me (i.e. just a guy among hundreds of thousands of IT professionals) a Software-defined Storage solution should at least fulfill two requirements:
- total abstraction/independence from the hardware layer: the software drives everything
- policy-driven orchestration: policies determine where the data has to be located based on a finite set of imperatives (data placement, compliance constraints, service levels, quality of service, data availability and data protection etc.). Note: orchestration may not be the correct word, but I’m jetlagged and it’s early anyways.
It seems that Primary Data fulfills both these requirements.
The customer feeds a set of “predicates”, i.e. requirements such as SLAs, data availability, costs etc. into DataSphere. These subsequently drive the way data should be placed across the environment.
However it’s unclear as to whether this solution (or any other) also delivers additional requirements. If there is total independence from the hardware layer, how are we taking advantage (or not) of array specific features such as compression, deduplication etc.? Are these handled by Primary Data or is Primary Data polling the endpoint and leveraging these capabilities when available?
Currently, Pd proposes this for a single data center but the case was made about imperatives related to data location (=must be stored / must not be stored in a given DC / location / country). Per David Flynn, Pd is working on adding such a feature to their product.
Final Thoughts
To me Pd seems to be an ideal tool for data consolidation projects in large environments with a lot of legacy and complexity; or as a policy-driven data management solution.
Primary Data should theoretically offer a true single pane of glass to manage a variety of arrays across an enterprise and effectively allow teams to work without the hassle of managing every single array and their specific subset of features.
The advantages for data consolidation are obvious however this will require a lot of preparation ahead of time and not only for defining correctly (hopefully) all of the predicates that matter for an organization, but also because of the specific way data is stored. Migration/conversion of LUNs or integration of NFS storage will not be possible, and there will be a lot of back-and-forth work in environments where storage arrays are already well used and mostly provisioned i.e. creating LUNs for Pd, giving those LUNs to Pd, migrating data to those LUNs, vacating the LUNs, providing more space etc. Still it might be worth the effort in case the environment is not made of legacy storage arrays nearing their end of life.
Customers should consider whether their primary driver is data consolidation or policy-driven data management.
If you primary reason is data consolidation, then Pd would be helpful as a tool to help you transition from madness to sanity i.e. shrinking your installed base (variety of models/vendors/storage types) and moving to one – two platforms) but only on a temporary basis.
If your motivation is however to drive your environment by policy with a wise use of meaningful predicates and a proper “encoding” of all the imperatives, costs etc then Pd might becom a powerful platform to control your costs, ensure you adhere to compliance/data placement policies at all times and guarantee that customers get the performance and expected service they pay for.
Primary Data seems to be an interesting solution. I am looking forward to see how they differentiate from Hedvig to understand if these two contenders are playing in the same field and targeting the same use case/offering a similar proposition or not. Nevertheless as you have seen, Primary Data have piqued my interest and I think that it’s a platform/company worth following.
Disclosure: this post is a part of my SFD10 post series. I am invited to the Storage Field Day 10 event by Gestalt IT. Gestalt IT will cover travel, accommodation and food during the SFD10 event duration. I will not receive any compensation for participation in this event, and I am also not obliged to blog or produce any kind of content. Any tweets, blog articles or any other form of content I may produce are the exclusive product of my interest in technology and my will to share information with my peers. I will commit to share only my own point of view and analysis of the products and technologies I will be seeing/listening about during this event.