When determining storage capacity, performance and availability requirements for their mission-critical systems, most if not all organisations try to solve the equation, once presented with alternatives, by looking primarily at pricing and attempting to juice out the best for what they can afford or be prepared to pay. While this approach has the merit of keeping IT costs under control, an essential aspect that very often not factored in the above equation is the cost of downtime for the organisation and how high-availability needs to be evaluated. This article was inspired by the presentation delivered by the Dell EMC VMAX Architecture presentation at Storage Field Day 14.
About mission-critical systems
First and foremost, to even be able to approach the question of high-availability requirements and downtime, the organisation should be very well conscious of its critical, core business processes and the mission-critical systems that support those processes. Obviously, core business processes are those vital for the company to operate and generate a revenue, and possibly as well a profit; the inability for core business processes to run smoothly will bring an organisation to its knees and make it unable to operate.
To illustrate this, think of a global air travel operator whose reservation system is unavailable or under severely degraded performance. The impact is global, affects tens of thousands, if not hundreds or millions of customers and causes cascading loss of business (loss of business for the air travel operator, but also for the business travellers who cannot fulfill their work duties). The most simple, unassuming IT activity can have an unparalleled downstream impact on a company’s business, profit and credibility on the market. For example, the May 2017 British Airways IT outage cost the airplane operator an astounding cca. 80 Million British Pounds (GBP), with 75,000 passengers affected and cca. 670 flights cancelled from Heathrow and Gatwick (source: BBC). Those who travel regularly to London, one of the busiest airplane travel regions in the world, can appreciate the breadth of the impact and the resulting chaos.
Mission-critical systems are different from organisation to organisation, and always need to assessed from an enterprise-wide and business perspective. I call this the « objective criticality » of a system, i.e. the real criticality, with people feelings and emotions about their « baby application » removed from the context, to judge only on real impact. The impact of a performance disruption or downtime incurred on mission-critical systems will be massive & potentially widespread, with major adverse consequences for the organisation: loss of revenue, loss of reputation and eventually regulatory scrutiny/sanctions if the organisation operates in a regulated environment (for example banking of pharmaceutical industries). To give an example on the banking world, a financial institution could get as far as losing their operator license in certain markets, which can have dire consequences on revenue and reputation, not to mention the downstream impact on customers, and stock price if the organisation is listed on one or more stock exchanges. It’s also important to point out that mission-critical systems will not necessarily be business applications. The unavailability of authentication/authorisation systems (such as Microsoft Active Directory, for example) can be extremely nefarious and lead to loss of revenue as well.
Mission-Critical Architectures for Mission-Critical Systems
Analysing, quantifying and weighing those risks belongs to what is generally called a Business Impact Assessment (BIA) plan. The BIA should be used in architectural discussions to determine non-functional requirements such as availability, reliability, integrity and survivability. It won’t come as a surprise that mission-critical systems must ideally be highly available at the application level, but this doesn’t exonerates the infrastructure layer to also provide high availability by various means. We could split that part into redundancy of hardware components and the existence of multiple failure domains on one side, and the requirement for the infrastructure to be up and available at all times, even when maintenance or upgrade activities are being performed, because the supported mission-critical system cannot tolerate any downtime without severe disruption to business activities and indeed the infrastructure reliability depends on the availability and chosen architecture of the hardware components. Reliability is often expressed in the industry by a certain number of « 9’s », where the higher number of nines, the higher the yearly uptime (see this handy table). Obviously, the higher the yearly uptime is, and the more likely this have an impact on the solution price.
Regarding data integrity and survivability, it’s clear that data integrity is such a must-have requirement for any infrastructure system (and essentially for storage systems) that it is not even mentioned: you just come to expect it (and rightfully so). But in the context of survivability, i.e. the ability of a system to continue operating in the case of a disaster, the integrity of the data is critical. This will give us the opportunity to shift from abstract concepts explained above to more technical topics. Getting back to our mission-critical systems, we know that they are a critical lifeline of the organisation, and that they must run and keep running without incurring any downtime, or only with the minimal downtime possible. In this case, it’s necessary to have a robust Disaster Recovery / Business Continuity strategy that takes fully into account the requirements of the Business Impact Assessment, including RPO (Recovery Point Objective, defining at what point in time the data can be recovered) and RTO (Recovery Time Objective, the time needed to restore the service). Mission-critical systems have usually very stringent, near-zero RPO/RTO requirements (RPO: almost no loss of data can be tolerated, and RTO: service must be almost immediately restored) and it’s no surprise either that the nearer to zero RPO and RTO requirements are, the heftier the price is for a solution able to deliver on both aspects.
Solving stringent RTO and RPO requirements for mission-critical systems, especially on the storage side of things (where enormous amounts of data need to be instantly available and cannot suffer lengthy data transfers) require the implementation of synchronous replication capable infrastructure, which ensures data integrity on one hand (each write on the primary site is performed on the secondary site and fully acknowledged once the primary and secondary sites have both acknowledged the write), and survivability on the other hand (data is immediately available). Because we are discussing the storage layer, it would be of course desirable in such an highly available mission-critical system to have an application layer that is itself highly available and able to understand the underlying storage layout.
Dell EMC VMAX: a proven and modern mission-critical all-flash storage architecture
This lengthy but necessary detour finally brought us to the topic of this article: how the latest Dell EMC VMAX All Flash storage platform is a perfect fit for mission-critical systems. Veterans in the storage industry know that the VMAX can pride itself of an heritage that goes all the way back to the EMC Symmetrix high-end storage arrays that were popular in the early 90’s.
Technically speaking, the VMAX All Flash platform is the 3rd iteration of the VMAX product range, after the initial VMAX launched in 2009 and the VMAX3 platform launched in 2015. The VMAX is a Tier-1 high-end storage array for flash / large scale usages, operating at petabyte scale with 10.000’s of LUNs, and leveraging 3D NAND flash memory. Dell EMC’s stance on the VMAX is very clear, besides providing high performance, the goal of the VMAX platform is to offer hardware resilience, and be the best in class in this category. With multiple controllers, 2.5 million hours MTBF, controls in place to keep data in persistent RAM, online controller upgrades and thorough testing of firmware releases, VMAX checks all the tick marks in that category. Regarding performance, the top notch configurations is capable of processing up to 6.7 million IOPS at sub 1ms latency, while claiming to achieve a consistent response time of up to 350μs at massive scale.
VMAX is currently being offered in two versions, the 250F and the 950F, where the 250F is to be seen either for smaller deployments or for isolated/dedicated storage pods for x86 architectures, while the 950F is more to be seen as the larger brother, perfect fit for large data centers and mixed use cases. The 950F also supports mainframe systems and the ability to operate in mixed mode, delivering storage services to both x86 and mainframe systems, making it a great asset for cross-platform storage consolidation. As with other Dell EMC products, it is possible to purchase two software packages/services with your VMAX: the « F » range (standard) or the « FX » range (advanced services, including SRDF). It is also possible to get an « F » model and consume/license services on a per need basis.
The core component of a VMAX is a V-Brick, made of one engine (each engine has two directors), two DAEs (disk-attached enclosures), up to 72 Broadwell CPU cores and up to 1 PB on the 250F (up to 4 PB on the 950F). The 250F scales to 2 V-Bricks, while the 950F scales to 8 V-Bricks, creating a matrix of 8 engines / 16 directors.
But all of these features, no matter how great they are, pale in comparison with SRDF. SRDF is Dell EMC’s crown jewel and stands for « Symmetrix Remote Data Facility ». It is, in fact, Dell EMC’s data replication software, and the gold standard in the industry for data replication. As Vince Westin from Dell EMC said very adequately at Storage Field Day 14: « Customers buy SRDF and they get a VMAX array with it ». VMAX is a great Tier-1 solution, but SRDF is unique and you can get it only with a VMAX, because it’s been designed for the Symmetrix/VMAX architecture. It works reliably and is nurtured/cherished/looked after by Dell EMC like the golden goose, because it is the only product in the industry that offers not just the features, but also such a track record, being currently deployed and used at 70% of Fortune 100 companies.
SRDF allows customers to replicate tens of thousands of volumes across up to 4 locations around the globe and the version in use with VMAX All Flash has been improved to leverage flash optimisations. SRDF works in various modes:
- SRDF Synchronous (SRDF/S), which allows zero data loss mirroring between data centers distant up to 100 km (or 60 miles)
- SRDF Asynchronous (SRDF/A), which allows asynchronous replication to up to three or four data centers, with distances up to 12875km (8000 miles)
- SRDF/Metro, which allows for active-active data availability, stretch clustering and non-stop access to data, either within a data center or between two data centers distant up to 100 km (or 60 miles). SRDF/Metro allows to cover the most mission-critical applications by providing a unique namespace (LUN identifier) that overarches two replicated LUNs each present at a different location. This allows for seamless failover of applications without the need to do any manual activities.
Attempting to explain SRDF in such a short preview makes no justice to the product; almost every major mission-critical infrastructure or software provider (such as IBM for Power Systems, or Oracle for their database) support SRDF and have lengthy documentation about how to configure it.
I unfortunately do not have figures at hand regarding pricing of the VMAX All-Flash 250F and 950F solutions, but the several Dell EMC representatives I’ve talked with unanimously said that the solution isn’t as nearly expensive as it sounds, considering the value of SRDF, the proven resiliency of the VMAX platform and the fact this is an all-flash system. Perhaps you should engage your Dell EMC representative and ask them, you might hopefully be surprised.
It would be hard to resume what all Storage Field Day 14 delegates agreed to say was a very high quality presentation session in a lengthy article, I thus highly recommend you to watch the VMAX-related recorded video sessions at Storage Field Day 14.
Max’s Opinion
The TCO of operating an infrastructure that serves mission-critical systems may appear to be high, but when compared with the cost of a downtime or degraded performance, and with the reasonable expectation that it is managed and maintained properly, the benefits usually far outweigh the CAPEX and OPEX. CIOs, VPs and board members can understand IT concerns about resilience, availability and survivability of mission-critical systems, but they will judge primarily (if not exclusively) solutions on cost vs benefit. The best of them will be very well aware of the mission-critical processes and systems in their organisation, while in other organisations the responsibility (or accountability) will fall more under business lines heads or process owners.
In all cases, the non-functional requirements listed above and the architectural decisions deriving from them should emanate from inputs present in a BIA, and those inputs are essentially business-related. In my opinion, while it is IT’s responsibility to provide the proper architectural guidance, design the best possible solution and highlight any potential adverse outcomes for alternative solutions, especially in terms of systems availability and survivability. It should also be clear to the business and senior management that IT is designing requirements based on the inputs of the business itself (the BIA). In that sense it is critical for designing an efficient solution that business stakeholders which contribute to the BIA and sign it off have a proper understanding of the mission-critical system and how its outage may impact enterprise-wide operations & processes, otherwise IT might design a solution that is adequate to the requirements present in the BIA, but not reflecting the reality of the mission-critical system uptime requirements and operational impact in case of downtime.
Disclosure
This post is a part of my SFD14 post series. I am invited to the Storage Field Day 14 event and Commvault GO by Gestalt IT. Gestalt IT will cover expenses related to the events travel, accommodation and food during the Commvault GO and SFD14 event duration. I will cover my own accommodation costs outside of the days when the events take place. I will not receive any compensation for participation in this event, and I am also not obliged to blog or produce any kind of content. Any tweets, blog articles or any other form of content I may produce are the exclusive product of my interest in technology and my will to share information with my peers. I will commit to share only my own point of view and analysis of the products and technologies I will be seeing/listening about during this event.