Executive Report

Business Continuity Planning

Posted December 31, 2002 | Leadership |
bizcon

 

"Anything that can go wrong will go wrong."

 

-- Murphy's Law

If you live in California or Alaska, you are well aware that on any given day the earth beneath your feet may suddenly and violently shift. If you live in the Great Plains of the US, dealing with severe thunderstorms and the occasional tornado is a given. If you live on the Gulf of Mexico or Southeast Coast of the US, you know that late summer and early fall may bring with it the heavy rain and high winds of a hurricane. Since we all know that Murphy (of Murphy's Law) is indeed alive and well, we know that unexpected events can impact our lives and our business processes, often in unpredictable and undesirable ways. Attempting to anticipate events that would impact our organization and trying to limit or eliminate the loss that would result from such events is known as business continuity planning (BCP).

In a larger sense, BCP is an outgrowth of the traditional disaster recovery planning and contingency planning processes that conscientious organizations have always done. Instead of limiting the planning process to the requirements needed to recover from a major disaster, BCP broadens the scope to include a wide range of additional issues such as asset inventories, disaster mitigation, and avoidance and disaster awareness training. Regardless of whether you call it business continuity planning, contingency planning, or disaster recovery planning (DRP), the process can generally be thought of as having four basic phases:

  • Phase one: project initiation. The business case for the BCP effort is prepared, and the BCP team is assembled. Existing business continuity plans, disaster recovery plans, contingency plans, and security plans are reviewed.
     
  • Phase two: business impact analysis (BIA). The key business processes and their supporting enterprise assets are identified and prioritized, various failure scenarios are examined, and an impact analysis is performed on the key business processes that would be affected.
     
  • Phase three: contingency planning. Alternative business processes are identified for each significant business process at risk, and a cost-benefit analysis is performed. Plans are then drafted for contingency operations and the eventual return to normal operations. Risk avoidance and mitigation plans are also created.
     
  • Phase four: business continuity management. Contingency and recovery plans are tested, and their results are evaluated for key business processes. Any deficiencies are noted, and the contingency plans are updated and revised. Periodic reviews of the business continuity plan are also done to ensure that it remains current. Awareness training and contingency training are done periodically to keep personnel up to date and ready for potential emergencies.
     

Various BCP methodologies may define the phases differently and shuffle the tasks around somewhat between the phases; in this Executive Report, we represent a fair overview of the entire process.

We expect that the average reader is familiar with disaster planning and disaster recovery topics. You most likely work for an enterprise that already has an existing business continuity plan (or at least significant disaster recovery plans and other major portions of a business continuity plan) in place. We hope that this report will provide you with some food for thought for the next update of your plan or that it will prompt you to improve the plan by adding sections or processes that you may have overlooked. If it does nothing more than prompt you to dust off a copy of your last plan and look it over to ensure that it is still valid, it will likely have accomplished its purpose.

WHY DO BUSINESS CONTINUITY PLANNING?

The BCP times are a-changin' -- that's the overwhelming impression you are likely to get if you do any browsing of recent articles or books on the subject. The big change, of course, is linked to one eventful date: September 11. Current writings on the subject would have you believe that the tragic events of September 11 have dramatically raised organizational awareness of the need for robust business continuity plans. The attacks on New York and Washington, DC, are said to have caused nearly all organizations to reexamine their BCP needs and to seriously consider the possibility of a previously unknown type of catastrophe: terrorism. Despite the obvious attention focused on the need for robust business continuity plans due to the attacks, many organizations haven't followed through with the vigorous BCP efforts that were widely predicted. With few exceptions, consulting organizations and other companies marketing disaster recovery solutions have been generally underwhelmed with new business opportunities in the months immediately following September 11.

There are two likely reasons for the lack of follow-through on BCP efforts: the economy and the "500-year-flood" effect. The first reason that BCP efforts have been sluggish can most likely be attributed to the slow US and world economy. As with training, consulting, and other non-revenue-generating activities, planning efforts of all types are typically curtailed during slow economic times. This includes BCP. The other reason BCP efforts haven't picked up may oddly be because of the September 11 attacks: their magnitude and severity may have led many to believe they are very rare events (somewhat akin to a 500-year flood) or that the threat of political terrorism is isolated to overseas or high-visibility targets such as New York or Washington, DC. Organizations that were not directly impacted by the attacks may have concluded that, since they were unaffected this time, they likely would not be affected by similar attacks in the future.

Geopolitical issues aside, the importance of BCP and DRP has not changed in recent years: it has always been an essential component of a robust enterprise's portfolio. Furthermore, acts of terrorism (which, in reality, are just politically motivated acts of sabotage) are also nothing new: disgruntled employees have always been a significant consideration in disaster planning. Although the magnitude of loss caused by the September 11 attacks was certainly unprecedented, it hasn't altered any of the fundamental reasons for doing BCP.

So why should enterprises develop (or revise) their business continuity plans? Though the reasons for doing BCP and some of its economic justifications are similar to those for catastrophic loss insurance, they are not precisely the same. The purpose of most types of insurance is to receive monetary compensation for some loss, such as the destruction of inventory by a flood or the loss of a key individual. And although monetary compensation may recover some of the costs associated with a business disaster, it does nothing to ensure that the organization can continue to do business: if products can't be shipped for several weeks due to the loss of inventory, the company is not likely to survive. Insurance may cover the monetary cost of a loss, but BCP aims to "insure" that the enterprise survives the disaster. One thing that insurance coverage and business continuity plans share is that they increase the value of an organization by reducing its exposure to risk and by increasing confidence that the organization can survive what would otherwise be a crippling blow.

There may be legal mandates that require your organization to have a continuity plan. Healthcare organizations and engineering firms developing nuclear reactors are just a couple examples of organizations that are mandated to have recovery plans for the data they create.

BCP and Y2K

Many small and medium-sized organizations got their first real taste of BCP in the waning years of the last decade. Y2K forced nearly every enterprise on the planet to evaluate its readiness and to draw up continuity and recovery plans for all manners of potential business interruptions. Since the extent to which Y2K would impact computerized systems was unknown, organizations were forced to plan for events they normally would never have considered -- from the loss of power grids to riots and civil unrest.

And even though most of those Y2K plans were never implemented, the exercise served up many valuable lessons for BCP. Y2K forced organizations to inventory all of their systems and other physical assets; prior to Y2K, most organizations only had a hazy idea of what assets they had or how they might fail. Organizations were also forced to think not in terms of recovering from a particular disaster, but in terms of what was required to maintain minimum operational ability. It forced organizations to think about disaster planning beyond the traditional backup and recovery of data, and it made clear to most that BCP was all about business processes, not just IT processes.

Though it was an extraordinarily valuable exercise for organizations, Y2K planning was quite different than conventional BCP. With Y2K, we knew exactly when the disaster was likely to strike (on 1 January 2000 and a few other critical dates) but had no idea what type of disaster was likely to strike or how much impact it might have. With conventional BCP, we know very little about the "when" as well.

PHASE ONE: PROJECT INITIATION

In the project initiation phase, a business case justifying the BCP effort is prepared, and the BCP team is assembled. The project plan and project charter are created. Existing business continuity plans, disaster recovery plans, contingency plans, and security plans are reviewed.

Preparing the Business Case

With any significant BCP effort, a business case should be prepared. This is done principally for two reasons: to justify the expenditure of time and resources needed to complete the project and to secure the backing of one or more executive sponsors. According to Angela Robinson, former chairperson of the Business Continuity Institute (BCI), key issues to consider when preparing your business case are [7]:

  • External influences, such as regulations, legislation, and customer requirements
     
  • Losses; hard costs, such as buildings, equipment, revenue, and increased charges; and soft costs, such as corporate image, goodwill, and market share
     
  • The impacts of a backlog of work
     
  • Avoidance of scare-mongering
     
  • Use of methods relevant to your audience (e.g., a guided tour to point out threats and vulnerabilities)
     
  • Believable and relevant arguments, substantiated with hard facts
     
  • Accurate statistics from a reputable source
     
  • Timing (e.g., just prior to an annual audit, when executives are likely to be most receptive)
     
  • Use of a project sponsor to gain access to the senior managers/board
     
Justifying the BCP Project

As mentioned earlier, the reasons for doing BCP and its justifications are similar in many ways to those required to justify most types of insurance, but they are not precisely the same. The purpose of insurance is to receive monetary compensation to make up for some economic or human loss. Thus, its justification is relatively easy to compute: the return on investment (ROI) for insurance can be calculated by considering the replacement value of the asset versus the likelihood of its loss.

Justification for developing business continuity plans is quite different. To begin with, the objective of the continuity plan is to minimize loss, not replace it. Hence, the value of the plan is rather soft: the cost of the plan versus the potential cost of a loss without the plan. Another objective is to minimize the downtime caused by an event. The purely economic justifications for "minimizing loss" and "minimizing downtime" are even harder to compute -- how much can we justify spending (real dollars) to ensure that a loss does not occur (no dollars)?

Consider the following as an example: according to statistics published by the US Geological Survey (USGS), any given location in my area (metropolitan Kansas City, Missouri, USA) has a statistical probability of having a tornado of "significant strength" pass over it once every 5,000 years on average [4]. Assuming that I want my business to persist in one location for five decades (which is unrealistically high), that means the likelihood of a tornado damaging my facility is about one in 100 over the lifetime of the facility. If I move my facility to the heart of Tornado Alley (which is further west and south), the probability is still only about one in 40. Even assuming that a damaging tornado would result in a complete loss (which isn't usually the case), it is difficult to make an economic case for a disaster recovery plan based on a direct hit by a damaging tornado. And yet nearly every organization of any size in this part of the country has a disaster recovery plan in the (apparently quite unlikely) event of a tornado. Why is this?

When developing the business case for a BCP effort, one must look beyond simple ROI for other more relevant justifications. A balanced scorecard-like approach can be more useful, balancing the needs of financial performance with other needs, such as fiduciary responsibilities, customer service needs, legislative mandates, and business vision. In part, the reason that organizations in the Great Plains have a tornado disaster recovery plan is that everyone expects them to have one; not having a plan would result in not only a loss in the unlikely event of a tornado, but also in significant legal liability above and beyond the cost of replacing any damaged property. The biggest reason that we have tornado disaster recovery plans is because organizations are expected to survive a tornado.

Another factor that can help justify the planning effort is that plans don't necessarily have to address specific disasters. Most recovery plans can be generic, in that the actual cause of the disaster is largely irrelevant. Plans for recovering from the loss of a facility may be applicable regardless of whether the facility is destroyed due to a fire, flood, tornado, or act of terrorism. Therefore, it may be possible to justify some disaster planning based on generic disasters, rather than trying to justify the cost of planning based on any single unlikely event. Disaster mitigation planning, however, is less likely to be generic: steps taken to mitigate damage from a tornado, for instance, may be different than steps taken to mitigate damage from a hurricane.

Executive Sponsorship

An essential component required for any BCP effort is executive management support. This includes, but is not limited to, the CFO of the enterprise. The business continuity plan is, by definition, an enterprise-wide plan, not a departmental plan. (Although departmental contingency plans can certainly be developed under the larger framework of an overall business continuity plan, there would be no purpose served in having individual departmental plans that ensure the continuance of one department even if the remainder of the organization does not survive.) Therefore, the CFO is an ideal executive sponsor for the BCP project and is certainly required in any case as a supporter and as a subject matter expert (SME) in evaluating the enterprise-wide financial impact of any given planning scenario.

Proponents of sound BCP efforts may face an uphill battle. In addition to the economic conditions and apathy factors discussed earlier, many organizations still believe that BCP is not an enterprise issue, but rather a business unit or IT function. According to a recent Ernst & Young survey of 459 CIOs and IT directors, 29% of survey respondents say BCP is a business unit expenditure, 45% of respondents say their disaster planning expense is borne by the IT budget, and 49% of respondents report their disaster plans have actually been tested.

The BCP Team

As with all other types of enterprise-wide planning efforts, BCP must be a team effort -- it cannot originate simply within IT or any other single department. In addition to executive management support and participation, the team must also have SMEs versed in the analysis of business processes, IT, and project management. SMEs from other technical business areas can be co-opted as temporary members as required, but the core team should be kept small to keep it effective and agile and to minimize political posturing. Large teams are more difficult to schedule and find meeting space for and are more likely to get involved in political gamesmanship.

Perhaps most important, the core team must be composed of people who command great respect within the organization and who are capable of making executive decisions (or at least obtaining buy-in from executives on matters that benefit one department to the detriment of another). They must have the authority and the vision to develop plans that will ensure the survival of the organization in the event of a major catastrophe. This is by its very nature a controversial proposition: ensuring the survival of the organization does not necessarily mean the survival of all departments within the organization or even all business functions. Departments on the periphery of mission-critical systems or those of functions the organization can do without for a significant period of time are likely to put a great deal of political pressure on plans that leave them out of the mainstream. The team must be able to make such difficult decisions and be able to successfully defend them to executive management.

Project Plans and Project Charter

Once the team is assembled, the project plan and project charter are created and presented to the executive sponsor for sign-off. The plan will detail the overall project schedule and deliverables. The project charter will detail the assumptions under which the project will be developed, the roles and responsibilities of team members and other participants in the project, the success criteria, and the metrics under which the project will be measured and tracked.

Keep in mind that, since this is an enterprise-wide project, its management will require the coordination of multiple departments. By definition, this is much more of a program management effort than a simple project management effort. If your organization has a program management office (PMO) to coordinate projects that cross departmental boundaries, the BCP effort should be managed through the PMO.

Review Existing Continuity Plans

Once established, the team should conduct a detailed review of any existing business continuity plans, disaster recover plans, contingency plans, and security plans to get an overview of the current BCP state. The state of the current business continuity plan, along with the vision of the organization and the executive management, will dictate the scope of the current BCP project. This is also a good time to dust off the old Y2K plan (if it hasn't been reviewed since "the end of the world as we know it" didn't happen) to see what contribution it might be able to make to the current BCP.

PHASE TWO: BIA

Identification of Critical Systems

The first part of the BIA stage is the identification of essential business systems. To establish what portions of the business should be continued, the BCP team must develop an inventory of essential (frequently referred to as mission-critical) business systems. Once identified, each system is prioritized for criticality. A determination is made for each system concerning how much downtime is acceptable (or conversely, what percentage of uptime availability must be maintained).

It is important to remember that, at this stage, the BCP team is not yet identifying failure modes or contingency procedures, it is simply building an inventory of critical business systems (not simply IT systems, but business systems). An essential business system is one that, if it failed, would cripple the organization or render it incapable of carrying out its principal mission. For a retail organization, for instance, the principal mission or objective is to achieve a sale to the consumer; for a manufacturing organization, the principal mission is to deliver a product to the customer. For a financial institution, such as a bank, a principal mission is to make loans. An extended failure to accomplish these missions would result in the demise of the organization.

As a general rule of thumb, you will find that the business processes that are used most often are usually the ones that are most important to the organization. Business processes that are performed multiple times per day are likely to be more important than those used once a month or once a year.

Follow the Supply Chain

Though many contingency planning efforts begin and end with IT systems, it is crucial in BCP efforts to ensure that the entire supply chain for the organization is protected. Organizations exist because they derive revenue from products or services they provide; hence, any rational BCP effort has to revolve around the survival of the products and services provided by the organization.

Unfortunately, when developing an inventory of essential business systems, it is quite easy to become distracted by a "silo" mentality. Naturally, each department or business unit within the organization knows its own critical systems and its own critical functions, and it is sometimes tempting to inventory essential systems on a department-by-department basis. This strategy is flawed, however, in that it tends to suboptimize the business contingency plan. Plans created in this manner tend to consider only the survival and recovery of components of the organization and may not consider whether the entire organization will survive an interruption event.

It can also be tempting to inventory the organizations' essential business systems based on a particular major vendor and not on the business functionality. It is not particularly helpful, for instance, to identify PeopleSoft or SAP as critical systems; it is much more appropriate to identify the business functions that those packages perform for the enterprise, such as payroll or general ledger or human resources.

A much more appropriate method for discovering key business processes is to examine the enterprise's supply chains (sometimes known as business functions or lines of business). A supply chain is a beginning-to-end process that accomplishes one or more key business activities. Each supply chain usually begins and ends outside the organization and touches several internal departments during the process (see Figure 1). Developing a business systems inventory in this manner will be critical to the continuity of the organization, although it may seem counterintuitive to observers inside individual departments. It may also generate heated political battles; assessing criticality by supply chain may place a high value on systems that a department may consider relatively unimportant and may place a low value on systems that a department may consider critical to its operation (see Figure 2). This is not to say that critical departmental systems are unimportant to the BCP process; however, it does indicate that their priority to the enterprise is less critical than other departmental systems central to the key business activity. Hence, contingency plans may still be developed for critical departmental systems, but only after contingency plans for systems involved in key supply chain activities.

Figure 1

Figure 1 -- A supply chain (partial chain example).

 

 

Figure 2

Figure 2 -- Prioritizing systems by department can endanger supply chains. Although each department may prioritize its systems based on its own needs, a failure in a noncritical system in one department may disrupt the flow of one or more critical lines of business.

 

 

Extending the Supply Chain

As organizations more tightly integrate their processes into multi-company supply chains, there is starting to be a greater need to expand the scope traditionally thought of as BCP. In the past, survival of your own organization was typically the principal objective of the BCP effort. Today, your business continuity plan is likely to include (or at least be coordinated with) continuity plans for your largest customers and your largest suppliers. Organizations have rightly come to recognize that dependence on a small set of critical vendors (on the supply side) or a small set of large customers (on the client side) is a risky proposition. Traditionally, organizations have tried to limit their risk in this area by diversification. In many cases, diversification is simply not possible; depending on the nature of the business, only a few critical vendors or large customers may exist. (How many alternatives are there, for instance, for vendors of desktop computer operating systems? How many large customers are there if your business is creating CPU chips?) This is particularly true for a certain class of vendors that are indispensable to every organization -- utility vendors. Providers of electricity, water, sewer, and communication facilities are often government-sanctioned monopolies within a given geographic area, and no alternative vendors are even available. Thus, it is entirely appropriate to approach those organizations to request a review of their disaster recovery plans and to participate whenever possible in helping them remain operational.

Broadening the scope of your continuity plan to extend outside the organization can present some interesting opportunities and may introduce some thorny political problems. It may be possible, for example, to partner with some of your vendors and/or customers to mitigate risk by sharing backup facilities. Depending on your organization's position and importance in the supply chain, you may find that as a major customer you have additional financial leverage with your vendors; it may be to their advantage to assist your organization in funding your BCP effort or funding some portions of your disaster recovery or contingency operations. The flip side of this opportunity is more problematic: you may find yourself being asked to partner with your competitors in order to mitigate risks for major customers you share.

Essential Systems Inventory

The various organizational assets that support each essential business system or function (e.g., achieve retail sales, manufacture goods, or make loans) should be inventoried. IT systems (both hardware and software) that support the functions must of course be inventoried, as well as any other additional equipment and facilities that are needed to support the function.

The inventory may be documented in a variety of ways. The preferred method would be a database for the inventory; however, very small organizations could probably make due with something simpler if a database were beyond their ability. If your organization has existed for a few years, you can probably resurrect the systems inventory database created for your Y2K effort and bring it up to date (if it hasn't been maintained).

Each essential business system component should be evaluated (and the results documented in the inventory) for its uptime availability requirements. We know that certain systems components are massively critical and must be available a high percentage of time. (Quality wonks often have a habit of referring to uptime availability as a function of how many "nines" are required: three nines equate to 99.999% uptime, five nines equate to 99.9999% of the time, and so on.) Some systems have a variable criticality: a monthly billing system may be able to tolerate some downtime between billing cycles but must have a high availability during billing periods. Some systems can be offline for hours or days without significantly affecting essential business processes. In some cases, it may be useful to document both uptime availability and permissible downtime. A system may have a maximum permissible downtime of five minutes; however, a downtime of five minutes once every month versus five minutes once every hour results in a huge difference in system availability -- 99.99% versus 91.67%.

I spoke with a former colleague of mine, Jim Baird, manager of systems services for BV Solutions Group, Inc. (BVSG), a division of the Kansas City-based Black & Veatch engineering firm. Baird is a BCP expert and has developed numerous plans, not only for BVSG's parent organization, a US $2-billion-per-year global engineering firm, but also for other organizations in both the public and private sector. Baird observes that when doing a BIA, most companies discover that five days is the most they can stand to be idle without their doors open. Baird explains, "Five days really seems to be the rule of thumb these days, and that interval is not going to get longer in the future; it is going to shrink." He adds, "If you don't have your systems operational after five days; if you don't have your phones working again so you can take orders; if you don't have your people present to reload operating systems and restore data, then you are probably done as a company."

The Human Factor

One critical component of the asset inventory that is often overlooked is the human resource: the people that are required to operate the essential business systems. It makes little sense to identify key business processes without also identifying the personnel necessary to execute the processes. Every organization has an implicit understanding of the critical people necessary to maintain the health and welfare of the enterprise. A successful business continuity plan requires that those people be explicitly identified and their roles and responsibilities clearly delineated. It further requires that key people know their roles and responsibilities in the event of a disaster. Waiting until after a disaster has occurred is not the time for an organization to discover that its key people don't know where they are supposed to be or what they are supposed to be doing.

We seem to be in an era when many organizations believe that employees are easily replaced commodities; it is no wonder that so little thought is often given to what personnel will be required to execute a business continuity plan or what the needs of the individual employee may be following a major disaster. Identifying key people; training key people; keeping plans in locations where key people can easily find them; and ensuring that the health, welfare, and security needs of key people are met during a major disaster is an element of successful business continuity that simply cannot be overlooked.

In an article for globalcontinuity.com, Keith Miller writes [6]:

Many organizations have invested time and resources producing comprehensive and detailed continuity plans only to find those who wrote them only understand them! Why do so many prove ineffective when they have to be implemented? What has caused the employees, who should be the core of any plan, to become disconnected from them?

Continuity plans are a necessary part of an organization's ability to successfully deal with situations that are not normal. They tend to fall into three main categories:

  1. Those that are so detailed that they are incomprehensible to those suddenly expected to implement them; not flexible enough to deal with rapidly developing situations; and not "bought into" by those affected by them.
     
  2. Those that are a paper exercise. They have not been based on realistic solutions, real sourced resources, or effectively validated.
     
  3. Those that have been developed with the people who are affected by them, bought into by them, rehearsed by them, and are a living process.
     

Baird has the following to offer:

Up until a few years ago, disaster recovery was looked at as mostly an IT problem. My background is technology -- computer systems and IT operations -- but I have learned, in my years of doing business continuity and disaster recovery planning, that a business continuity plan is worthless without the people aspect. The technology side of the plan is the easy part; the hard part is the people side of it. You may have a plan for your IT operations, but do you have plans to deal with 50% absenteeism after a moderate to severe disaster? You may have a recovery plan, but if you don't have people to implement it, you are just as out of business as if you would have been without a data center.

Large organizations generally understand the importance of their people to a business continuity plan. It is even more critical that small and medium-sized businesses understand the impact of people on a business contingency plan. If two people in a five-person business can't get to work because of an ice storm, for instance, that means that 40% of your workforce is unavailable. Baird says, "Small businesses have to understand just what it is that each of those people provide the organization that it absolutely has to have to survive a disaster. Two people in a car going to lunch involved in an accident can shut down the organization, as can an outbreak of the flu where two or three people are home sick, as can people having to stay home with their kids because school has been cancelled for a couple of days due to a snowstorm. All these minor problems can have the same major effect on a small business." You have to understand the organization's exposure to loss of personal knowledge, operational expertise, contractual obligations, and so on to be able to effectively plan around the loss of key people.

Inventory

Another key factor in the discovery of critical business systems involves the product and raw materials inventory for your organization. Since just-in-time manufacturing techniques have become commonplace, having significant stockpiles of inventory to survive a disruption in the flow of goods has become largely a thing of the past. And as we saw numerous times in recent years with automobile manufacturers, a stoppage in flow of parts from one small manufacturer can bring the entire manufacturing process to a complete halt within a few hours or days. Although short-term outages can be limited by increasing inventory levels in anticipation of an outage, increasing those levels will also affect ordering, cash flow, handling costs, and floor space.

Failure Modes

The impact of catastrophic failures is strongly correlated with the inverse of the frequency with which they are likely to occur. For example, failures that are likely to occur most frequently (i.e., less time between failures) typically have little impact on systems; failures that occur only rarely (i.e., more time between failures) can have a huge impact (see Figure 3). Broadly, there are three basic categories of disasters to consider in contingency/disaster planning: natural disasters, intentional damage, and accidental damage. For the business continuity plan, the team should evaluate the possibility of each of these types of failures and rate the level of business disruption each would be likely to cause.

Figure 3

Figure 3 -- Relationship of frequency versus cost of failures (disasters).

Natural Disasters

Although we may know with some probability where a natural disaster may strike, we don't usually have a clue as to when it might occur. Nor can we know the severity of the event: will it be a minor annoyance or a major catastrophe?

USGS categorizes a variety of natural disasters and publishes maps showing the relative risks for different areas of the country (see [4] to view maps). Keep in mind that there are numerous other hazards such as blizzards, ice storms, hail, epidemics, and drought that are not included in this set of maps. And fires, for instance, resulting from natural disasters, accidents, or sabotage may occur anywhere at any time. Depending on where your facilities are located, you will likely have one or more natural disasters that pose a reasonable threat to your organization. You may also wish to consider natural disasters that pose a threat to the major facilities of your significant vendors and customers as well, since a disruption of those would impact vital supply chains for your organization.

As opposed to most other types of damage that can occur, natural disasters have a disturbing habit of affecting not only your facilities, but also the infrastructures near your facilities. Your building may suffer damage; the utilities providing power, water, and communications ability to your building may be damaged or lost; the roads providing access to your buildings may be impassible; and your personnel may be killed, injured, or otherwise occupied with damage to their homes or loved ones.

Additional statistics published by the US Federal Emergency Management Agency (FEMA) show a rather disturbing trend for natural disasters (see Figure 4 and Table 1). For the period of 1976 to 2001, the number of officially declared FEMA disasters has shown a consistent increase, not just in the cost of declared disasters but in the number of declared disasters. An average of 28.4 disasters were declared per year for the first five years of the period, while the average for the most recent five years is more than 75% higher at 49.8 per year. The causes for this increase in number of declared disasters are not entirely understood but, in part, are surely related to increasing population density.

Figure 4

Figure 4 -- Disaster declarations from 1976-2001. (Source: US Federal Emergency Management Agency [FEMA].)

Intentional Damage
 

Table 1 -- FEMA Disaster Costs (in US Dollars)

Year
Declared Disasters
FEMA Funding
1990
38
$434.1 million
1991
43
$548.6 million
1992
45
$2.793 billion
1993
32
$1.877 billion
1994
36
$8.221 billion
1995
32
$1.592 billion
1996
75
$2.429 billion
1997
44
$1.914 billion
1998
65
$4.193 billion
1999
50
$1.395 billion

Though September 11 may have raised the bar on costs associated with sabotage, malicious acts intended to cause damage to an organization have long been a business contingency factor. And although terrorism may be getting the most press these days, sabotage caused by disgruntled employees, dissatisfied customers, labor unions, organized crime, or even deranged individuals is always a possibility. This is apt to be even truer during a down economy, as there are usually more disgruntled employees and dis.satisfied customers during those tough times.

It is important to note that with respect to damage caused by disgruntled employees, those who have been laid off or terminated may be less of a risk than those who still work for the organization. We may have very effective security methods to keep people who no longer work for the organization from damaging our vital assets; however, techniques for preventing or limiting damage that can be caused by people who are supposed to be working with those corporate assets are much more problematic. Although dividing potentially disruptive tasks among two or more employees can help (forcing people to conspire to cause damage or theft is a time-proven loss-prevention method), it is often impractical in a smaller organization or for some security-related functions.

In addition to the risks associated with actual damage, a simple threat of damage can be disrupting in the extreme. You will recall that following in the footsteps of September 11 came an entirely different assault through the US Postal Service: anthrax. The threat of bioterrorism is likely to remain a concern to many for the foreseeable future. And although the likelihood of an actual bioterrorist attack can be debated (even suicidal fanatics know that releasing a disease or toxin that is likely to affect everyone on the planet isn't particularly useful to their cause), one scenario far more likely to occur is that of a terrorist hoax. One of the contingency plans that must be considered these days is that of a forced evacuation of a building or an entire area in the event that some suspicious substance is discovered. You may already have plans to evacuate a facility for a couple of hours to investigate a bomb threat, but do you have a plan in place that would permit your business to continue if your main location had to be evacuated for 48 hours to determine whether that white powder discovered in the mailroom was anthrax or baby powder?

Accidental Damage

The late Robert H. Courtney Jr., a former security chief at IBM and one of the industry's first gurus in computer security and computer crime prevention, was fond of saying, "People making mistakes are going to remain our single biggest security problem. Crooks can never, ever catch up." Nor can natural disasters. Accidents caused by people -- although typically more localized than natural disasters -- comprise the bulk of incidents that cause loss of productivity and threats to business continuity (more on that in a bit). From coffee spilled in the wrong place to dropped laptops to hazardous material spills to fiber-optic cuts, accidents can interrupt both facilities and infrastructure and can kill or injure key personnel.

One of the largest factors in environmental damage is one of the most ubiquitous substances on the planet: water. Water is far and away the most likely environmental factor to cause damage to computer systems, inventory, and archives. With computer systems, this is often due to our propensity to place data centers either on the top floor of the building (where the roof can leak) or in the basement (where all the water naturally drains in the event of a flood). Fire suppression systems are also surprisingly likely to cause water damage even if there is no fire in the immediate area: leaks are a problem, as is dripping condensation from cold pipes on hot, humid days. (Courtney used to point out that one of the most cost-effective solutions to this problem was a $2 roll of plastic wrap stored close to the computer. If there is a sudden water leak, just throw some plastic over the computer until the leak can be stopped.)

I am continually amazed when visiting data centers, warehouses, and archival storage facilities at the number that still have water sprinkler systems for fire suppression. One archival facility I visited recently was typical: it belonged to a Fortune 50 company whose irreplaceable corporate documents and backup tapes were stored several levels underground (no problem so far, right?). The archive, however, was immediately adjacent to (and well below the water level of) a major body of water, and the sprinkler system that predated the use of the space as a corporate archive was still in place and active. Although I happened to be touring the location for an entirely different project, I did suggest to the director that they consider a liquid-free fire suppression system or at least keep files in big waterproof plastic bags inside the file cabinets.

Another point worth mentioning here is that intentional damage may be exacerbated by accidental or at least unintentional acts. The perfect example is a destructive virus or worm that is innocently opened and unleashed by an unsuspecting end user. Though training, education, airtight firewalls, workstations without a diskette or CD drive, and fanatic e-mail screening methods can certainly help, they can't entirely prevent such problems. It has been observed that one can't make anything foolproof, because fools are too ingenious. One organization I worked with lost its e-mail capability for several days, not because of a worm that snuck through its filters, but because of one attached to a personal Hotmail message opened by one of the senior executives.

Utility loss is also another big cause of lost time and lost data. Although many people tend to think of utilities in terms of just electricity or natural gas, water and sewer service may be just as important. If water is needed in order to keep the building air conditioned, the loss of water means that you can't keep the building cool enough to keep people or computers in the building. This is not to mention the fact that if the bathrooms aren't working, people aren't going to be able to use the building for very long.

Disaster Likelihood

So what type of disaster is most likely to disrupt your organization? When looking at one cause of outage -- the loss of data -- we find that man-made losses far outweigh losses due to natural disasters (see Figures 5 and 6). Judging from the statistics published by data recovery services, the two most likely events to cause loss of data are hardware errors (e.g., hard drive failures) and human error (e.g., deleting the wrong file, overwriting data). Malicious acts such as viruses account for only a small fraction of data loss. Natural disasters such as tornados, earthquakes, and floods account for the fewest of all.

Figure 5

Figure 5 -- Root causes of data loss. (Source: Ontrack Data Recovery Services.)

 

 

Figure 6

Figure 6 -- Root causes of data loss. (Source: Data-Protectors.com.)

 

 

These data loss statistics are likely a bit misleading, however. As seen from the perspective of vendors providing backup and data recovery services, hardware errors may very well seem to be the most common root cause for hard drives that are sent in for data recovery. But ask any network administrator or technical support person about the root cause for the overwhelming majority of lost data and end user downtime, and you'll get a different answer: the end user him- or herself. Disasters caused accidentally by end users overwriting new files with old versions, deleting the wrong file, storing data on a network drive and forgetting where it is, losing passwords to password-protected files, spilling coffee on the laptop, etc. are far and away the most common cause of lost data and lost end-user time. (For a humorous look at some of the trials and tribulations faced by tech support personnel, I heartily recommend the Web site www.techtales.com.) Further complicating the "fumble-finger" data loss problem are two major issues. The first is that for any given organization, there may be twice as much data stored on local hard drives as there is stored on centrally managed servers. Further, almost half of those local systems are never backed up, and only one-quarter of them are backed up at least weekly [3]. As Baird puts it: "In every business impact analysis I've been involved in, a critical location of data is always 'Susie's C: drive.' In many cases, there is only one copy of this critical data, and it doesn't take a disaster to destroy it, just a bad day and a case of the dumb thumbs."

Paranoid senior-level managers can also cause problems: we have anecdotal evidence of at least one CEO who didn't want to risk having his sensitive information available to prying eyes on the network, so he would never let the tech support people hook his laptop up to the network. Consequently, when he lost the computer, all his critical data was also lost since it had never been backed up. Sometimes the interests of security and disaster recovery can be at cross-purposes: keeping copies of data can compromise security; not keeping copies of data will compromise recovery efforts.

Also misleading are the enormous figures often quoted for the damage caused by computer worms and viruses (published, in large part, by security firms selling virus protection solutions). Go out on the Web sometime and search on "virus damage." You will easily find statistics on monetary damage caused by viruses in the hundreds of millions of dollars, sometimes even more. I personally take these numbers with a very large grain of salt. Although viruses are an incredible nuisance and can indeed cause lost time for the end-user community while e-mail systems are repaired and their machines disinfected, much of the "lost time" often included in such inflated damage estimates is attributed to that time required by systems administrators to patch various virus vulnerabilities (begging the question: what are systems administrators supposed to be doing with their time instead?).

Best Practices

Output-oriented analysis (working supply chains backward) -- a useful tool in investigating supply chains in order to inventory critical business assets is to begin with the end result of the supply chain (its output) and work your way back upstream. Keep in mind that most critical supply chains will terminate in transactions to some entity outside the organizations (a product shipment, for instance) and will originate with transactions from an outside entity (a customer order, for instance).

Risk maps -- Mark Jablonowski, of Otis Elevator Co., writes in Contingency Planning & Management [5] about a technique he calls "risk maps" as a means of analyzing an organizations exposure to risks (see Figure 7). He recommends using logarithmic scales to rate probability of loss versus the cost of loss, making the scales easier to expand or contract based on the range of planning involved (small losses with high probabilities can be measured on one scale: large losses with less likely possibilities can be rated on another). He writes:

(T)he process of populating a map, exposure by exposure, provides a more thorough understanding of the risks an organization faces. To do so, areas of exposure to risk must be identified, and this is at once a technical and creative exercise. It requires a planner to completely explore the organization and its processes. To develop adequate probability and consequence estimates, a planner must also become familiar with the mechanisms that lead to risk -- this increased familiarity serves the planner well when dealing with risk and its effects. [5]
Figure 7

Figure 7 -- A risk map example.

PHASE THREE: CONTINGENCY PLANNING

The speed with which interrupted systems must be restored is more important today than ever before. In the past, a business was typically open for only a few hours during any business day and was closed for many hours each night and on weekends. Thus when faced with a business-interrupting event, the organization had some amount of downtime in which to recover. Though this may still be the case for some small retail organizations, most organizations now find themselves as participants in a global economy that never sleeps or takes a holiday. No longer do organizations have the luxury of recovering over a matter of hours or days: the 24x7x365 uptime demanded in today's economy means that for critical systems, downtime must either be repaired very quickly, or contingency measures must be taken to prevent downtime from occurring at all.

In developing business contingency plans, we want to consider modes of failure that are apt to occur with some reasonable frequency: although an asteroid strike big enough to wipe out the dinosaurs does happen from time to time, it doesn't happen often enough to be a practical concern (besides that, you can't practically plan for an event that destroys all your facilities and all your customers and vendors).

Criteria should be established for detecting potential failures and invoking portions of the business continuity plan. BCP must also specify who is authorized to invoke which parts of the plan and the procedures by which key personnel are to be notified. Notification lists, backup personnel, and contingency equipment must all be documented as part of the plan and kept current. Further, a team should be established to monitor the operation of the contingency plan once it has been put into effect and should have broad authority to deal with unanticipated problems that may arise.

Disaster Detection

Damage that is caused by some disasters is patently obvious (the damage left behind from an F5 tornado is kind of hard to miss, after all). Some disasters, however, are subtler in nature -- so subtle in fact that it is easy to overlook them when they occur. This is especially true for damage related to computerized systems, communications facilities, and data. After all, if the computer is still sending out data, we rarely stop to wonder whether it is the correct data. If the computer system is slow, we may not know whether there is a problem or whether it's just unusually busy. If the phones aren't ringing at the help desk, we may not know whether the phone system is down or whether all the end users are at lunch. If the heater doesn't come on in the winter, we might not be able to tell whether it's broken or whether it's just an unusually warm day. Detecting failures is the unavoidable first step in disaster recovery.

Disaster Recovery

Once a disaster has occurred (and has been detected), the focus shifts to disaster recovery. Disaster recovery plans are meant to repair the injury to the organization -- restore it to full health and normal operations. By their very nature, disaster recovery practices are not "business as usual" but are meant to bring about "business as usual" as quickly as possible.

Personnel

In the event of a disaster, the first components that must be mobilized are the people who will direct and operate the recovery plan. The plan must detail roles and responsibilities to be followed for the duration of the recovery, as well as contingency plans if key people are unavailable. If the recovery plan depends on personnel from third-party providers, service-level agreements (SLAs) should be put in place well in advance of any disruption. SLAs are basically contractual obligations spelling out what services will be performed, to what level of quality, and with what speed. It is worth noting that most third-party providers will contract with a number of organizations to provide emergency services, based on the assumption that not all of them will require services at the same time. As many organizations unfortunately learned in the September 11 attacks and in the power crisis in California in 2001, an SLA is not a guarantee of service. Organizations may find themselves far down the list when a major catastrophe occurs, so it is worth reading the fine print.

One of the problems organizations face in an emergency situation is making sure all of its people are present and accounted for. Organizations should have a designated area near each company facility where employees know to congregate in the event of an emergency. Ideally, this area should be 200-300 feet away from the facility, with an alternate location available in the event of a more widespread disaster. Also useful is an emergency contact number where employees who are out of the office can call to check in. Routine testing of evacuation and accounting procedures can be done as part of regular fire drills and can prevent panic and confusion when a real emergency occurs.

Facilities and Infrastructure

Depending on the nature of the disruption, the facilities (and infrastructure required to operate the facilities) will be the first priority for restoration. If an alternate hot or warm site is to be used for operations until the primary facility can be restored, some consideration should be given as to the location of the facility. Clearly, you want the location to be close to the primary site, but not so close as to be vulnerable to the same disaster that disrupts the primary. For a fire risk, a few hundred meters is probably sufficient. For a tornado, a few kilometers is probably sufficient (although you don't want to align the two sites on a southwest to northeast line -- the direction most tornados travel in the northern hemisphere). For an earthquake or hurricane, you probably want a secondary location anywhere from 10 to 50 kilometers away, located in an area not susceptible to a similar disaster.

Baird has several thoughts on the issue: "If you locate the secondary center across the parking lot, you'll reduce several cost factors such as communication and security. One fire probably won't take out both sites. But a power outage would likely take out both, as would a tornado, a hazardous material spill, or a hostage situation. On the other hand, you don't want to separate them by too much distance." Prior to September 11, it was popular for many large organizations to have alternate sites located in a remote city close to a major airport -- the thinking was that recovery people could quickly be flown into the location from anywhere in the event of a major disaster. Unfortunately, with planes grounded and airports closed for several days after the attacks, this strategy turned out to be not viable as was once thought. Many organizations found that in such times of extreme crisis, recovery personnel would simply refuse to drive a long distance and stay in a remote location for an unspecified amount of time; many quit to stay home with their families rather than relocate indefinitely. Although the optimum distance for a secondary facility must ultimately depend on the circumstances of the risk, a commonly accepted distance is on the order of 10 kilometers or so (this may be insufficient protection for some widespread natural disasters, however). A good rule of thumb is that those personnel involved in the recovery process will have to be able to go home at night if the recovery period will last for more than a few days. Baird adds that "this is one of the biggest ongoing debates in disaster recovery: how far is far enough?"

Network Infrastructure and Communications

Once the secondary facility is secured, network and communication infrastructure must be restored. If wired communication infrastructure is unavailable, your contingency plans should include the use of cell phones (if the cell phone towers are still standing), line-of-sight optical communication equipment to beam information back and forth to a working facility nearby, or even satellite communication systems.

Data Recovery

Having restored network connections to the facility, it is time to restore the data used by the enterprise. And although having good backups is essential, it may not be sufficient to be able to restore the organization's operational data stores.

Unfortunately, the explosive proliferation of computer systems and applications has complicated data backup and restore capability enormously. In the old days, when all the data was on the mainframe, we could back up a snapshot of all the data during off-peak hours and recover it to that point. In today's environment --where data resides on client workstations and myriad servers, storage area networks (SANs), and external Web sites -- restoring to a single point in time is next to impossible. As Baird explains:

Each day it seems new systems appear: new computers are added to the network, new servers are added, new storage is added, and new software is installed. Nearly every one of them has their own backup system or their own backup schedule. When people actually try to recover from an interruption, only then do they discover that all the data backed up was assumed to be current to the same point in time. In reality, system A was backed up on Tuesday, system B was also backed up on Tuesday but that tape was bad, so we have to go back to last Thursday's backup ... and now suddenly everyone's data is junk because nothing is in sync.

As an example, Baird uses a relatively simple document management system:

Usually the document management system has a rather large database containing an index of the documents objects in the system. It also has a separate repository containing the actual document objects. Quite often the actual store of the documents and the database of the index may be spread out over different servers and backed up at different times. When the organization tries to restore the data after a disaster, they discover that the database and the documents do not completely match up. And that's just a simple document management system. If you look at some of the big, enterprise-wide applications where you have order processing on one system, shipping and tracking on another, order fulfillment on another, and supplier information on still another, and each system is backed up differently, you can begin to appreciate the magnitude of the problem.

If you have to archive information for a very long period of time, you have another problem. Not only is some of the magnetic media we store information on not rated for extremely long periods of time, but both the media and the software that understand the information on the media change over time. Therefore, if you want to maintain data for a period of many decades and be able to restore it, you will need a migration plan that will allow you to update and convert all your data from one media format to another as the old media formats become obsolete. And you must have a plan to maintain potentially multiple versions of the software that is able to read the information stored on the media. Baird adds: "All information related to the development of a nuclear power plant, for instance, is required by law to be retrievable for 60 years. There is not a single magnetic media that is rated for that period of time."

Another factor in data restoration is the sheer volume data organizations have these days. As the cost of disk drives has gone down, and their capacities have ballooned, many people have decided that it's cheaper to keep a copy of all their data online. Instead of just having a few gigabytes of necessary operational data, an organization can easily find itself suddenly having terabytes of end-user data to restore (a medium-sized organization of 50 people, each of whom has 100 gigabytes of data, will require five terabytes of backup). Restoring a few terabytes can't be done in an instant. Restoring from tape can take a very long time, especially if you don't have a lot of tape drives to use to restore.

Applications

The final step in restoring systems operations is the restoration of the applications systems required for business functions. As with the restoration of data, this operation is complicated by the presence of multiple versions of software, multiple patches, multiple configurations, and so on. If you have a free week or two, back up your local hard drive, wipe the disk, and restore it to exactly the same state -- if you want to see the problem firsthand. (I usually find myself deleting and reinstalling Windows and all my applications about once a year and have gotten fairly good at it. I can usually get everything back to working order in only a couple of days.) Since most people don't routinely back up their applications as well as their data, getting your system to work the way it used to will likely take days, if not weeks.

So if restoring data and applications is so problematic, what is the solution? The solution lies back with your BIA: if done well, it will define the minimum essential functionality that must be restored. This is important; there is a difference between restoring minimal essential functionality (which we need to keep the business in operation) and restoring full functionality to the way it was before the disaster (which can take days, weeks, or months, or which may not be possible at all).

Disaster Mitigation

Although many of the disaster scenarios previously discussed are unavoidable, a large percentage of them can be prevented, and the damage done by those that can't be entirely prevented can be limited. Disaster mitigation -- also known as disaster avoidance or disaster prevention -- aims to eliminate preventable disasters and limit the damage (and business continuity issues) for those that can't be prevented.

Data Replication

One strategy that is being increasingly deployed to provide very high uptime availability is data replication services over a SAN. Creating a SAN over a fibre channel ring (or using some of the newer, less expensive IP-SAN capabilities) allows an organization to have its data distributed across a wide geographic area. In effect, the SAN is used as a giant "RAID-like" system that enables data to be distributed across multiple systems in multiple locations. Tied together in a ring topology, this arrangement can provide for extremely high availability systems and is virtually impervious to localized damage. Coupled with either a hot site on the ring or a redundant operations center, SAN-based data replication strategies can provide a very resilient disaster mitigation strategy.

Such a strategy does not come without a cost, however. Depending on how far removed the SAN storage locations are from one another, communication costs can be huge; they can easily represent the largest ongoing expense. Backup and restore procedures can also be complicated, as can administration of the various hardware components that make up the facility.

There are some similar options available for those on a somewhat more limited budget. An ASP-like solution can approach the high availability of a SAN solution at a much lower cost. In this solution, application data is replicated on remote servers at two or more locations (with periodic data synchronization services provided). Client systems are then browser-based and can therefore operate from anywhere that has an Internet connection. Of course, sending mission-critical data over the Internet is a concern for many organizations, but encrypting the traffic through a virtual private network can provide a high degree of security.

Reserve Systems

Another similar concept developed over the past few years is the reserve system. A reserve system is a minimally functional subset of a full business system that will provide for a minimal level of processing in the event of a disaster. Typically, the reserve system runs on a single workstation and may be located anywhere an Internet connection is available (e.g., at a secondary company site or even an employee's home). Access to the system is via a browser over the Internet.

Although not really feasible until recently, reserve systems can be extremely cost-effective and can provide for minimal business process functionality at very low cost while the normal business systems are being recovered.

Security

One of the biggest mitigation techniques for facilities and information systems generally falls under a blanket term known as security. As computer networks have become more interconnected and enterprise business practices more and more entwined into multi-organizational supply chains, the need for physical and computer security systems has risen dramatically.

Though ROI for many disaster mitigation solutions can be elusive, some are easier to justify. A former colleague of mine, Monte Beery -- Director of Business Development for eSecurityOnline LLC, an Ernst & Young LLP Company (www.esecurityonline.com) -- says that tools like his organization's Advisor product are easier to make an economic case for, since the amount of time saved by the tool can be quantified and compared to the time it would take to do the same task manually. Advisor is a security audit tool that automatically scans a corporate network and reports the real and potential vulnerabilities of the hardware and software components it finds. Beery thinks, however, that "the unintended side benefits are the real reason to do risk mitigation. The real value in doing a risk assessment is in examining the processes at risk and understanding them."

Security Versus Efficiency

One of the problems inherent in security planning is that increasing security nearly always decreases usefulness. It is possible, for instance, to make a computer system absolutely secure: you simply remove all input devices and all network connections. The dilemma, of course, is that a completely secure system is also completely unusable. A similar problem also occurs in organizations with respect to the location of their people: most organizations naturally tend to group people in the same department together in physical proximity for organizational efficiency. After all, you don't want people who must work together to have to run down the hall or drive across town each time they have to interact. Most organizations similarly tend to locate all of their senior executives together on a single floor, often in a block of window offices overlooking the entrance to the corporate headquarters. Especially since the Oklahoma City bombing in 1995 and the events of September 11, we have come to realize that convenience of location is at odds with disaster planning and recovery. A single, sudden catastrophic event can completely decimate the SMEs of an organization; a well-timed act of sabotage or terrorism may decapitate an organization. Many large organizations have travel policies that prevent a significant number of executives from traveling together on the same flight; fewer have policies that physically distribute the locations of organizational memory and management.

We used to tell individuals and small businesses not to worry too much about securing their information systems from outside intrusion and vandalism. Up until a few years ago, the hacker community was generally uninterested in systems owned by companies that 1) had no secrets worth knowing, 2) had no money or customer account information to steal, or 3) weren't considered a technological challenge to be conquered (such as breaking into NASA, for instance). This unfortunately is no longer true. Almost all organizations now keep sensitive customer and employee data that can be exploited for either direct economic gain or for identity theft. And poorly secured computer networks can be co-opted by malicious individuals wishing to use their systems as "zombies" to launch distributed denial-of-service (DDOS) attacks against other, more visible systems. You'll probably recall that the Web sites of Yahoo, Amazon.com, and CNN were generally unavailable for a few hours in early 2000 due to a series of massive DDOS attacks. You may not be aware that the largest attack occurred just recently -- on 21 October 2002, when the 13 "root servers" for the Internet came under a brief but largely unsuccessful DDOS attack. Since September 11, the risk of cyberterrorists using these types of methods has become a concern to many. The era of "safe computing" practices has most definitely arrived.

And although the development of a comprehensive network security plan is beyond the scope of this report, its elements are generally similar to those of an overall business continuity plan:

  • Identify the critical components of the information systems. This would include all key workstations, computers, network components, connectivity devices, and power sources, as well as key software systems such as e-mail servers, authentication methods, encryption tools, transport protocols, and so on.
     
  • Develop and document the risks to key systems, such as physical destruction or loss, intrusion, viruses and worms, unauthorized use, sabotage, and so on.
     
  • Develop and document the measures that will be taken to prevent (mitigate) each of the risks identified. These will include any security policies put in place by the organization as well as training and enforcement procedures.
     
  • Develop and document the measures that should be taken in the event that a security breach occurs.
     

Keeping a computer network secure while at the same time keeping it available and useful is a constant challenge. One area in particular that will present enormous challenges in the near future is the growing popularity of wireless network connections. As they become more numerous and increase in range, they will present ideal entry points for malicious users wanting an anonymous remote connection to the system.

Public Relations Planning

Emergencies and disasters will invariably draw attention from the press. Needless to say, it is not appropriate to wait until after a disaster has occurred to start training your people about what to say (and not say) to reporters. If your organization has a hostage situation, bomb threat, anthrax scare, or a catastrophic event, your BCP must include a public relations plan that can be readily implemented.

Every organization, regardless of its size, should designate one person (and a backup) as a media spokesperson to deal with all press inquiries (even if it's just someone who knows how to say "no comment"). Employees should also be trained on the importance of public relations, how to recognize media requests, and how to politely but forcefully forward a request to the appropriate people within the organization. Press coverage of a disaster and your organization's reaction to it can directly impact your client base -- public opinion of your measured reaction (or lack thereof) to a major disaster can be a key factor in restoring your operations to a normal level.

Your media contingency plans should include:

  • Identification of a designated media spokesperson (and a backup spokesperson).
     
  • Identification of the types of emergencies or disasters that would likely cause interest from the press. This would include a planned response to natural disasters common to your region, high-visibility threats such as bomb or bioterrorist threats, hostage situations, and other workplace violence incidents.
     
  • Planned responses to general emergencies and topics likely to be of interest to the press in your area, such as reactions to major layoffs or geopolitical events. This is particularly true if your organization is considered by your local media to be an industry or civic leader.
     

Most small and medium-sized businesses will generally fall "below the radar" of the press for most emergencies in which they are not directly involved; larger organizations are apt to draw the most attention after an event occurs to a similar organization elsewhere.

Business Resumption

If your contingency plans and disaster recovery plans operate correctly during an emergency, your key business functions will survive the failure. The final portion of the business continuity plan should be instructions for a return to normal business operations at the conclusion of the emergency. This portion of the plan is often lacking or missing entirely. The plan should detail the transition from operation in failure mode back to normal operations in an orderly manner. This business resumption portion of the plan may also include lingering mop-up operations; for instance, manually captured information will need to be entered into the restored system, and data entered into a backup system during the failure may need to be converted and/or reconciled against the normal operational data.

Best Practices

Baird has some excellent cost-effective advice for best practices in contingency planning:

1. Keep building floor plans in a nearby alternate location. One of the lessons learned from the 1999 Columbine school shootings in Colorado, USA, and other workplace violence incidents is that it is useful to have a copy of your building floor plans offsite but nearby. You don't want to wait until a workplace violence event or a fire or a bomb threat to discover that you don't have any floor plans except for the ones sitting in the boiler room in the basement of the building that is affected. The fire department, police department, SWAT teams, and other emergency responders will need that information quickly, and it's an inexpensive thing to do. The only expenses you have are a few dollars to make backup copies of the floor plans and perhaps a plastic box to store them in. It just might save the life of that key person that you need to recover your systems.

2. Have outside experts review contingency and security plans whenever possible. Just as organizations routinely have outside auditors review their financial statements, it is desirable to have an impartial outsider review your business contingency and security plans. It may be a painful process for a lot of organizations, particularly for sensitive aspects of the security plans, but like an outside financial audit, it's something that organizations should probably get used to. An outside consultant may notice things that people inside the organization don't (or can't) see. And, in addition to reviewing the contingency plans, it is often useful to have the contingency auditor physically inspect critical areas of your organization. He or she might point out that having your network communication lines run through the utility room isn't such a great idea, for instance. One automobile manufacturing facility Baird visited kept all of its fire equipment in the warehouse right next to its inventory of air bag detonators; in the event of a fire, employees might not want to run down the aisle where the highly explosive components are stored in order to get to the fire extinguishers. Although you want internal people to develop and embrace contingency and security plans, it's very healthy to have someone from the outside (with an impartial view) come in and look them over.

3. Work closely with your local emergency management agencies. Enterprises can get a lot of benefit out of contacting their local emergency management organization. Most local fire departments and emergency management departments like to work with companies and know that working with them on disaster planning benefits both the company and the community. Many local emergency management agencies have a lot of information available on the Web for personal disaster recovery plans -- things that individuals can do to keep themselves prepared for disasters. In working with your local emergency management agencies, you get not only the benefit of their past experience with other organizations, but also an opportunity for them to become familiar with your company. In the event of an emergency, they will be able to work with you more closely than they would be able to otherwise.

PHASE FOUR: BUSINESS CONTINUITY MANAGEMENT

The final phase of BCP is business continuity management. It isn't sufficient to simply create a BCP; it must be tested to ensure that it will work, and it needs to be maintained as the business, economy, and technologies change.

Testing the Plans

Testing a disaster recovery plan can be an expensive proposition for most large organizations in more ways than one. Organizing a full-blown disaster recovery drill, preparing the hot site, executing the failover procedures, and then restoring them to normal operations will require a significant amount of time and effort. Furthermore, the conditions under which they are executed may be so artificial and controlled as to render their value in a real disaster somewhat questionable. Worse still is that your disaster test itself may cause a major failure: it is quite possible to crash the production system while attempting to fail over to the alternate site, leaving neither one in working order. Disaster recovery testing is, perhaps not so surprisingly, one of the more common causes of lost data.

As with the development of business continuity plans following September 11, disaster recovery testing has also been in decline. A recent worldwide survey by globalcontinuity.com indicates that nearly a third -- 29.3% -- of the organizations responding had conducted no testing in the year following September 11, compared with 21.1% in the prior year [8]. The trend was even worse for small companies: 52% of organizations with fewer than 49 employees; 40.6% of organizations with between 50-499; and 44.7% of organizations with 500-999 employees did not do any testing of contingency or disaster recovery plans in the year following September 11.

Short of a full-blown disaster recovery drill, there are still some cost-effective tests that can be done to ensure a measure of disaster recovery capability. If you are going to back up to tape to be able to restore data, you should actually check periodically to see whether there is anything on the tape and that you can indeed restore if you have to. Horror stories abound concerning people who have libraries of blank tapes or have backed up just one small part of their data over and over again rather than the entire data store.

Another "virtual" technique to testing is also becoming popular. No doubt you have probably run across one or more of Joshua Piven and David Borgenicht's popular Worst-Case Scenario Survival Handbooks in which they relate a wide variety of expert advice on how to survive a host of unlikely events from the serious (taken hostage by terrorists) to the absurd (abducted by aliens). Though marketed as "humorous" guides to surviving disastrous situations, they illustrate a very useful tool in developing disaster recovery plans: "tabletop" scenario sessions. In a tabletop session, various disaster scenarios are played out with a group of senior executives and contingency planning experts in order to run through different failure possibilities. These exercises are relatively inexpensive and can be useful for testing certain aspects of the business continuity plan, particularly those related to emergency business processes and recovery procedures. Though more popular and economical than actual recovery tests, tabletop exercises were also somewhat surprisingly in decline in the year post-September 11. The globalcontinuity.com survey mentioned above notes that 45% of organizations had conducted at least one tabletop exercise for at least a portion of BCP (compared with 57.9% in the prior year), while only 23.6% of organizations ran a tabletop exercise of the whole plan (compared with 33% in the prior year) [8].

Review and Update the Business Continuity Plan

Dwight D. Eisenhower is quoted as saying, "Plans are nothing. Planning is everything." As such, the business continuity plans created represent our best effort at developing a response to all of the risks to our organization. Unfortunately, both the risks and the organization are not static; they change from time to time. The technological environment in which our organizations live will also change. The business continuity plan should be reviewed on a periodic basis; most experts recommend at least an annual review. The plans should also be reevaluated and updated whenever significant disasters occur and when significant changes to the organization are made. If a competitor suffers a massive hacker intrusion, for instance, you should review your security plans (and your public relations plans) to ensure that you are prepared for a similar attack. If your organization acquires (or is acquired by) another organization, the plans for each organization should be reviewed and consolidated. If you add a new class of computer or communications equipment to your network, the security plan should be updated accordingly.

It is more or less a cliche among those of us that do planning as a living that "plans should be a living document." Perhaps the largest danger many organizations face is that their plan may very well become "shelfware" if not periodically reviewed and updated as conditions change. Having a plan that is inadequate or hopelessly out of date is potentially just as damaging to the organization as having no plan at all.

Ongoing Awareness and Preparedness Training

It bears repeating once again that the human element is the most important component in the execution of a business continuity plan. Therefore, ongoing disaster awareness training and contingency training should be done periodically to keep personnel up to date and ready for potential emergencies. Fire drills, tornado drills, earthquake drills, and so on should be conducted at least quarterly. The plans must be readily available to the key people who will implement them and in a form that can survive an expected outage (plans kept only online won't be terribly useful if the network is unavailable or if people can't get to their computers; likewise, a fire in the library where the only copy of the plan is stored will present a problem).

Having a plan, training people on their roles and responsibilities under the plan, and taking a few inexpensive steps to limit damage caused by a disaster can mean the difference not only between the organization's survival or demise, but the literal difference between life and death for you or your personnel. One of the lessons learned at NASA is that repeatedly training individuals to respond to simulated failures and emergencies leaves them well prepared to handle real emergencies without panic and without causing additional damage in an uncontrolled effort to respond.

Best Practices: Tabletop Exercises

Baird highly recommends the tabletop exercise technique for testing contingency and recovery plans. Baird has conducted several of these tabletop exercises at different organizations and offers some valuable advice: "People can get very emotional even though they have never left the conference room, and you are only talking about a disaster," he says. "Tabletop exercises can be done at a corporate level, or each business unit can do their own version." Baird also finds the exercises provoke more thought if you throw in "injects" -- various little "gotchas" developed to completely disrupt the orderly recovery from a disaster. "Invariably, you'll get people who say 'I never even thought about that.' Executives get a much better awareness of the overall organization, they get a better understanding of what is important to the company, and they may realize that they can make one tiny little change that would prevent a catastrophe from occurring in the first place."

CONCLUSION

The costs of being out of business add up rapidly. According to recent statistics published by Contingency Planning Research, in conjunction with Contingency Planning Management, 46% of companies responding to their survey indicate that each hour of downtime would cost their company up to $50,000; 28% say each hour would cost between $51,000 and $250,000; 18% say each hour would cost between $251,000 and $1 million; and 8% say it would cost their companies more than $1 million per hour [2]. Depending on the business sector your organization is in, the costs can be even greater. According to Contingency Planning Research's 1996 survey, retail brokerage organizations could average a loss of $6.45 million per hour; energy organizations could lose $2.8 million per hour; credit card sales authorization organizations could lose $2.6 million per hour; telecommunications organizations could lose $2 million per hour; and retail organizations could lose $1 million per hour [1]. Keep in mind that these figures reflect downtime damage estimates prior to the explosion of the Internet as a business exchange medium. Perhaps most damaging in today's 24x7x365 era of online business transactions may be the damage inflicted on the organizations credibility: being offline for any significant portion of time may not only result in lost sales, but may also do irreparable harm to the reputation of the organization.

In conclusion, we would be remiss if we did not remind you that some of the largest failures in corporate history have occurred in the complete absence of an external disaster. Table 2 provides a handful of the more visible major American corporations that have filed for bankruptcy protection in the last few months.

Table 2 -- Major US Corporations that Recently Filed for Bankruptcy

Company
Filed for Bankruptcy on
Total Assets
WorldCom
21 August 2002
$103,914,000,000
Enron
2 December 2001
$63,392,000,000
Conseco
18 December 2002
$61,392,000,000
Global Crossing
28 January 2002
$30,185,000,000
UAL Corp.
9 December 2002
$25,197,000,000
Adelphia
25 June 2002
$21,499,000,000
Kmart Corp.
22 January 2002
$14,600,000,000

Although September 11 may have contributed in no small part to the poor economic climate of late, you'll notice that with the exception of UAL Corp. (parent of United Airlines), not a single one of these large companies suffered any appreciable damage in the attacks. And this list represents just the tip of the iceberg for business failures in the past year. According to ABI World, a provider of US bankruptcy statistics, a total of 38,916 businesses have filed for bankruptcy protection in the US in the previous four quarters (fourthquarter 2001 to third quarter 2002) [9]; this makes 2002 approximately in line with the total of 40,099 filings for all of 2001 in terms of numbers. But in terms of sheer size, 2002 is ahead on points. According to BankruptcyData.com, 186 public companies with a staggering $368 billion in debt filed for bankruptcy in 2002, exceeding 2001's record $259 billion. The reasons why these organizations have failed are many and varied, but it is interesting to note that some of the largest ones were in large part due to scandals and economic troubles of their own making. In addition to those companies that we have seen forced out of business due to questionable accounting methods, scandals involving lavish corporate bonuses and buyouts for disgraced or discredited senior executives continue to make headlines around the US. And although they have not yet filed for bankruptcy as of this writing, let us not forget the trials, tribulations, and the criminal conviction of Arthur Andersen LLP in the past year for its involvement in the Enron mess.

None of these statistics bode well for business continuity planners. Though it may be possible to plan for and survive a direct hit from a Category 5 hurricane or an 8.0 earthquake, it is understandably difficult to plan around greed and corruption at the highest levels of your organization. It is even more difficult to develop a crisis management plan that makes such behavior seem acceptable.

We may yet have to update the late Robert Courtney's maxim that "crooks can never catch up" to damage caused by accidents. Greed and lapses in ethical behavior, though they may not be exactly criminal, have certainly taken their toll lately on organizations large and small. Perhaps future business continuity plans should include regular courses on business ethics and generally accepted accounting principles.

BCP CERTIFICATION PROGRAMS

In the US, DRI International (www.drii.org) offers certification programs for BCP professionals. DRII offers three levels of certification: Associate Business Continuity Planner (ABCP), Certified Business Continuity Professional (CBCP), and Master Business Continuity Professional (MBCP). In the UK, the Business Continuity Institute (www.thebci.org) also offers professional BCP certification programs. BCI has five levels of certification: Student, Affiliate of the Business Continuity Institute, Associate of the Business Continuity Institute (ABCI), Member of the Business Continuity Institute (MBCI), and Fellow of the Business Continuity Institute (FBCI).

DRII has approximately 2,500 members in 15 countries, while BCI has approximately 1,100 members in 31 countries.

In 1997, DRII together with BCI published the Professional Practices for Business Continuity Professionals . The document can be found on DCII's Web site at www.drii.org/displaycommon.cfm?an=2.

ISO 17799

ISO 17799 is a detailed security standard that was published in December 2000. It grew out of a 1993 publication from the Department of Trade and Industry in the UK and evolved into BS7799 in 1995. It is organized into 10 major sections, each of which covers a different area. Section one contains the standard for BCP.

If you are interested in investigating ISO 17799, any Web search engine will happily direct you to a variety of sites that will provide you with more information.

ACKNOWLEDGMENT

I would like to take this opportunity to thank my former colleague, Jim Baird, manager of systems services for BVSG, for taking time out of his busy schedule to talk with me at length about his area of expertise. The fact that Black & Veatch -- a $2-billion-per-year global engineering firm that has been in business since 1915 -- trusts its future business continuity to Baird and his group is a huge testament to his skills and ability. BV Solutions Group, Inc. can be found at www.bvsg.com.

ABOUT THE AUTHOR

Dave Higgins has been a student of systems development and improvement methods since 1975. Together with Cutter Business Technology Council Fellow Ken Orr and the late Jean-Dominique Warnier, Mr. Higgins was one of the principal architects of the Data Structured Software Development methodology that was widely used in the late 1970s and early 1980s.

In his capacity as a software engineering evangelist, Mr. Higgins has performed hundreds of seminars on a wide variety of topics from program design and modification, to systems and database design, requirements definition, planning, and project management. As a consultant, he has advised many top organizations in both the public and private sector on technology planning and implementation. In the last few years, he has been specializing in knowledge management issues and strategic technology planning.

Mr. Higgins is also the author of five books on various aspects of software engineering. His first book, Program Design and Construction, published back in 1979, was perhaps the first on developing quality software for personal computers and was translated into more than a dozen languages. His book Data Structured Software Maintenance remains one of the few to address the practical application of structured concepts to the modification of existing programs. Mr. Higgins is the coauthor of Data Structured Program Design Workshop with Dave Scott and Duh-2000: The Stupidest Things Said About the Year 2000 Problem with Ken Orr. Mr. Higgins can be reached at dave@davehigginsconsulting.com.

REFERENCES

1. "1996 Cost of Downtime Study." Contingency Planning Research and Contingency Planning & Management (www.contingencyplanningresearch.com/cod.htm).

2. "2001 Cost of Downtime Online Survey." Contingency Planning Research and Contingency Planning & Management (www.contingencyplanningresearch.com/2001%20Survey.pdf).

3. "Data Protection Overview." SANSERVE (www.sanserve.com/SANSERVE.cfm?Page=DataProtect.htm).

4. "Geographic Distribution of Major Hazards in the US." US Department of the Interior, US Geological Survey (www.usgs.gov/themes/hazards.html).

5. Jablonowski, Mark. "Tutorial: Drawing Risk Maps to Improve Your Vulnerability Assessments." Contingency Planning & Management, September 2002, pp. 28-30.

6. Miller, Keith. "Continuity Plans -- The Staff Disconnect." globalcontinuity.com, 3 October 2002 (www.globalcontinuity.com/Article.asp?id=45000&ArtId=9756).

7. Robinson, Angela. "Project Initiation and Management." Continuity (The Journal of the Business Continuity Institute), Volume 1, Issue 2.

8. "Testing Goes into Reverse." globalcontinuity.com, 20 September 2002 (www.globalcontinuity.com/Article.asp?id=45000&ArtId=9676).

9. "US Bankruptcy Filing Statistics." ABI World (www.abiworld.org/stats/newstatsfront.html).

RECOMMENDED WEB SITES

General Information on BCP, Contingency Planning, and DRP

Contingency Planning and Business Continuity World (www.business-continuity-world.com).

Contingency Planning & Management Online (www.contingencyplanning.com).

Disaster Recovery Journal (www.drj.com).

Disaster Recovery World (www.disasterrecoveryworld.com).

globalcontinuity.com (www.globalcontinuity.com).

Survive (www.survive.com).

BCP Templates and Sample Plans

"Business Continuity Planning/Disaster Recovery Planning: An Online Guide" (www.yourwindow.to/business-continuity).

Canadian Centre for Emergency Preparedness (www.ccep.ca/ccepbcp6.html).

Contingency Planning Exchange, Inc. (www.cpeworld.org/projects/cpetoc.htm).

Disaster Recovery Journal (www.drj.com/new2dr/samples.htm).

State of Kansas sample contingency planning outline (http://da.state.ks.us/disc/bcpoutline.htm).

US Department of Agriculture, Natural Resources Conservation Service (http://policy.nrcs.usda.gov/scripts/lpsiis.dll/H/H_270_608.htm).

About The Author
David Higgins
David Higgins is a consultant and an author of five books on various aspects of software engineering. His first book, Program Design and Construction , was published in 1979 and was perhaps the first on developing quality software for personal computers. As a consultant, he has advised many top organizations in both the public and private sector on technology planning and implementation. In the last few years, he has been specializing in… Read More