This article was originally published in Czech language in the september issue of IT Systems magazine. Text © Massimiliano Mortillaro – Thanks to my colleague Tomáš Jirák for reviewing and proofing the original article in Czech.
Virtualization and business continuity?
The transformation of today’s datacenters, or the reign of nearly/fully virtualized architectures has indisputably brought many benefits to organizations. If we take aside obvious advantages such as increased flexibility, CapEx/OpEx reductions for investments & maintenance of hardware, virtualization also brought relatively affordable high availability mechanisms as an integral part of virtualized infrastructures.
The system provisioning process (from initial customer request to handover) previously lasted several weeks (ordering hardware, physical installation, patching, OS installation, pen-testing, etc..). Virtualization contributed significantly to speeding up provisioning operation, thanks to immediate allocation of compute, memory, storage, network resources as well as the use of pre-built templates for various applications and systems. This sudden shortening of time-to-deliver resulted sooner or later in an almost uncontrolled growth of virtual machines. The business perceives that the flexibility and high density achieved through virtualization offers them nearly endless capacity. This growth is also often caused by unreasonably high demands on system resources by application owners that do not always take into account the benefits of virtualization, such as physical core to virtual CPU overprovisioning, or improved memory management (often due to lack of knowledge/confidence in virtualization) which leads to capacity overprovisioning, which in turn leads to a waste of system resources. Another cause of uncontrolled growth lies in the extensive use of virtual servers for testing and development purposes (not always properly dimensioned, often running way after they are no longer required to run, thus consuming additional system resources). Finally, the main reason for this uncontrolled proliferation of VM is due to the fact that in many companies, IT departments operate in a reactive mode and are too busy resolving operational problems. They lack the time and resources (and sometimes the tools) necessary to gain a comprehensive overview of their environment. Also, they do not always know exactly which systems are business-critical and which are not.
Another factor is the fact that the adoption of virtualization was a progressive process. Journey into the world of virtualization began for many companies when IT departments decided to virtualize some of IT’s non-critical systems, then virtualization was extended to common applications. Nowadays, a majority of companies are virtualizing or are considering to virtualize Tier 1 workloads (business-critical applications such as databases, ERP, CRM, etc.). This progressive adoption can lead to situations where priorities are not properly defined, whether in terms of allocation of system resources or in how to ensure business continuity (for example, a test VM is backed up daily, while a virtualized production database server is not part of the backup plan).
IT departments operate in a reactive mode and are too busy resolving operational problems. They lack the time and resources (and sometimes the tools) necessary to gain a comprehensive overview of their environment. Also, they do not always know exactly which systems are business-critical and which are not.
Because of the previously mentioned reasons, the way to properly ensure business continuity of IT services wasn’t always properly considered: which VMs should be backed up, based on which criteria, which SLA/SLOs should be applied, which servers can eventually be sacrificed. Finally, the reasons mentioned above are not only process-based: at the beginning of the virtualization era, there were almost no commercial solutions able to ensure business continuity for virtualized architectures, thus the quality and efficiency of the implementation was largely dependent on the knowledge and skills of the IT team.
Ensuring business continuity is however a discipline that is not only about IT systems, processes and people. It is a much larger discipline that encompasses the entire enterprise activity, which aims to ensure the continuity of business operations. Ensuring business continuity requires a holistic view of the business operations as a whole. Let’s therefore focus on this issue in a broader context, especially with regard to ensuring business continuity and the use of modern concepts of virtualization technologies.
Everyone is talking about it, but few are acting on it. This is how we should characterize the status in many companies when we speak about risk management and supporting processes to restore the functionality in the event of failure of an IT system. Many people tend to confuse backup and recovery or even disaster recovery for business continuity. To be precise, backup and recovery is a technical process that aims to ensure the availability of business data in the event of damage or loss of one or more IT systems. Disaster Recovery is a business (non-IT) process, which aims to ensure an acceptably quick recovery of business operations during unplanned downtime, including the recovery of the affected IT systems and the rollback to normal state. It is a crisis response with short-term objectives.
Business Continuity is the highest and most complex level, which affects the strategy of the organization. It is a long-term view, which aims to ensure the continuity of business activities, from all points of view. The parameters are different and vary according to the industry in which the organization operates.
Business continuity is not only about IT systems, but also about processes and people.
Let’s consider, for example, a manufacturing company. One of their factories is destroyed beyond repair by a fire. The Business Continuity Plan should determine whether it is possible to move the manufacturing process to another plant for a limited amount of time, it should also dictate whether it is economically viable to continue in this business activity (by rebuilding the plant or relocating production lines permanently) or if termination of the activity is preferable from an economical point of view. And we didn’t even scratch the topic of employees and their fate. Such decisions are certainly not within the competence of the IT department, the Business Continuity Plan therefore usually arises primarily from business requirements. Organizations that implement risk management are usually better prepared for implementing business continuity.
But neither these organizations can shout “Victory!”. The reason is that in most cases, the considered measures are related to technical matters. Is the accounting system down for three days? Almost all the companies will deal with it somehow. But suffering irremediable loss of corporate data? Virtually (pun intended) no company can manage such a catastrophic event without serious effects, often leading to restriction or termination of operations.
Everything has its pros and cons
Without deeper investigation, it can be stated that virtualization brings only benefits. In fact, however, it isn’t so straightforward. With the consolidation of resources, the VM density per host also increases. Any incident may have – in comparison with classical architecture – a much more serious impact on a larger number of systems and services. The risk also lies in the interdependence between services and subsystems. In physical architectures, each application would rely on their own dedicated services (their database, their backup subsystem, etc.). With the advent of virtualization several services and systems can be merged, for example many applications typically store data on one or a few shared database servers. Plans for high availability must also take into account the order of recovery of individual services according to their importance, taking also in consideration the interdependency between systems and subsystems. In larger operational architectures, administrators rarely have accurate information on business requirements related to availability of services. This is another effect which negatively impacts proper implementation of business continuity, the solution may be accurate plans based on business impact analysis. Therefore it is also important to work with a partner who is able to ensure that high availability and business continuity are effectively designed.
With the consolidation of resources, the VM density per host also increases. Any incident may have – in comparison with classical architecture – a much more serious impact on a larger number of systems and services.
It is essential to define spheres of responsibility, i.e. who is responsible for what in the company’s information architecture. While IT has been traditionally in charge of technical matters including backup and disaster recovery, business continuity usually isn’t part of IT responsibilities. And it cannot fully be IT-owned. Business continuity is tied to a company’s top management and becomes de facto part of the strategic plans. It is primordial also to take account all the company’s business processes as well as responsible executives (of course, outside of the IT department). The IT department must, of course, communicate with the rest of the company, otherwise it will not have sufficient information on the priorities and potential impacts – in this case business continuity can be but an elusive dream. Moreover, it is not just about IT – for example imagine a plane, which carries all the company’s executive board members to an international conference crashes at sea.
Business continuity is tied to a company’s top management and becomes de facto part of the strategic plans.
A proper business continuity plan must define contingencies, key people, key locations, and should even account for the permanent physical loss (=death) of key staff members. It is therefore imperative that processes, recovery operations and even system credentials are properly documented and protected to circumvent loss of information related to the loss of a key staff member (such as the senior IT administrator who knew all the passwords but never documented them, for example). Business Continuity should have an owner (usually the Chief Information Officer, the Chief Risk Officer or Chief Technical Services Officer), a CxO-level sponsor, one or more project managers and a virtual team that comprises LOBs, process owners and IT.
Change in mindset and approach
Virtualization technologies have enabled a major shift from traditional architectures to a software-defined operating environment, usually supported by complex data centers. The individual parameters are defined by software and policies, rather than being defined manually and pinned to a specific hardware component.
Interesting is also the question of further developments. Please bear in mind that this is about the situation in Czech Republic – in some countries this is already happening! In the future, it will be all about automation with the necessary cooperation and preparation from the IT department. Requirements for the operating environment will be specified by the business, based on business requirements (resources, priority, availability, SLA/SLOs, RPO/RTO) and IT departments will provide detailed feedback on the financial impact for the business, and perhaps on further steps to flatten discrepancies between a LOB’s technical requirements and their financial possibilities. In practice, although this complex approach and the high level of automation are not yet widely used, it is a trend that supports an efficient implementation of business continuity in virtualized environments – as well as cloud environments- and this trend will keep growing.