When you hear the word “disaster,” your mind immediately conjures up images of massive floods or storms. While the IT wing of an organization certainly counts floods and storms as disasters, the term disaster in IT speak is not limited to those.
Common causes of downtime:
Hardware failures – 45%
Power outages – 35%
Software failures – 34%
Data corruption – 24%
External security breaches – 23%
Accidental user errors – 20%
Electrical failures, power surges, accidental deletion of source files, deliberate injection of malware into the systems by a hacker or a disgruntled employee; anything that affects the organization’s IT infrastructure and leads to severe data loss, outage, or downtime, is termed a disaster.
Disasters of any kind, including the human-induced ones, are unpredictable, potentially life-threatening, and costly. The aftermath (strictly from an IT perspective) is downtime lasting up to several days, loss of data, and equipment – all leading to billions of dollars in losses. Indirect costs are even higher – loss of business continuity due to a disaster can result in a poor reputation for the company and affect business prospects.
Disaster Recovery and Network Downtime
Network downtime is one of the chief consequences of a disaster, or sometimes even the root cause of it. A University of Chicago study of more than 500 unplanned IT outages showed that networking failures are the second-most-common root cause. It is also one that’s the hardest to recover from; digital businesses that are entirely dependent on their network are the worst hit. For every minute that the network is down, a company loses more than 5000 dollars.
How costly is network downtime?
- The average cost of network downtime is $5600* a minute, or $300K an hour.*
- 93% of companies which experience a downtime of 10 days or more due to a disaster file bankruptcy within 12 months of the event.**
- The airline industry alone has experienced several high-profile outages, including a Southwest Airlines router failure that led to an estimated $54 million to $82 million in lost revenue and increased costs
- Hosting company Peak Web suffered a 10-hour network outage, ultimately leading to its bankruptcy .
*source: the U.S. National Archives & Records Administration.
To mitigate the impact of a disaster and to recover from it with minimum losses, it is essential for businesses to have a sound DR (disaster recovery) plan, one that takes into account the data, systems, and personnel. Disaster recovery is a crucial aspect of business continuity planning (BCM), and should be formulated with utmost care. Ideally, a DR plan should have the following action items:
Key Ingredients in a DR Plan:
Inventory all assets: Most enterprises inventory their hardware and software periodically, but for a DR plan, it has to be more thorough. Assets inventoried should include applications and services, virtual machines, and all the IT and network components along with the contact details of the vendors’ technical support teams.
Take regular backups: Backup enterprise data, core device configurations, and other critical files, and store them preferably in the cloud or in magnetic backup tapes stored offsite. The backed-up information, when restored, will help you quickly get back up on your feet. Most large enterprises usually have a backup data center – a replica of the primary data center with data syncing continuously between the two – at a location far away from the primary data center. In case of a disaster, all operations quickly failover from the primary to the backup data center, preventing significant losses.
Analyze your tolerance to downtime: If yours is an enterprise that does millions of transactions a day, you can certainly not afford to be down for more than a few seconds – any more than that would be financially and morally debilitating. It’s important that you define your RTOs (Recovery Time Objective – the targeted time within which a company’s business processes be restored) and RPOs (Recovery Point Objective – the maximum duration admissible between the last backup and the time of disaster, which could have led to irreversible data loss), and accordingly negotiate the SLAs (Service Level Agreements) with both your vendors and customers.
Identify mission-critical applications and services: There are bound to be some applications and services that are the backbone of your business – these are the ones that need to be made available at the earliest following a disaster. Identifying such services will help you structure your DR plan by prioritizing those activities involved in bringing them online.
Have a clear protocol for personnel safety and communication: It’s not easy to stay calm and make sensible decisions when things are burning around you. This is why it’s crucial to devise a proper coordination strategy with clearly defined roles and responsibilities for the personnel. The protocol for a DR plan should include things like where to assemble, whom to contact, and the communication channels to be used, in case of an emergency.
Testing the DR Plan
96% of companies with a tried-and-tested Disaster Recovery Plan came through ransonware attacks unhurt.
DR plans are useful only as long as they work. The effectiveness of a DR plan can be established by conducting routine DR drills in a controlled environment. While frequent DR drills can take a toll on business productivity, not doing one at all can have major repercussions on the business in the event of a disaster. It’s recommended that enterprises conduct a minimum of two DR drills a year. The DR drills help:
- Poke and plug holes – potential failures, shortcomings, and threats in the DR plan
- Identify the actual disaster recovery metrics – RPOs, RTOs, and MTTR (Mean Time To Recovery) and compare them with the targeted metrics to improve the plan
- Test the level of data and device security during the restoration process
- Test and improve the readiness of personnel in case of an actual disaster
- In auditing and meeting regulatory compliance policies
- Constantly update the DR plan based on business changes
However, for many companies, their first DR drill is during an actual disaster. A study shows that 36% of the companies don’t test their DR plan at all, and while the rest do, most of them aren’t very thorough in their disaster recovery drills. Below are a few reasons for the lack of frequent DR drills:
- A lot of the processes involved in a DR drill are manual. Simulating conditions of a disaster, like disabling applications and taking systems and devices offline, taking backups of data and device configurations, and failing over to the backup data center – all of these activities require tremendous human effort.
- DR drills interfere with everyday business. Business and IT personnel are generally reluctant to participate in DR drills as they temporarily stop them from doing their jobs.
- DR drills involve a major coordination bridge the management, the application, network, security, and IT teams, who usually work in silos. Bringing them all on the same page is in itself a herculean task.
Here are some things that further complicate DR drills:
- Lack of visibility into the IT infrastructure
- Backing up and restoring existing configurations of IT and network devices that span multiple vendors
- Too much dependence on network teams for application-related processes
- A slow and inefficient change management system
- Difficulty in ensuring security and compliance all along the process.
The AUTOMATION+ Answer to a Quick and Easy DR Drill
AppViewX AUTOMATION+ is an IT management platform that specializes in DevOps-centric automation and orchestration of on-premise, hybrid-cloud, and multi-cloud networks.
AUTOMATION+ helps you flawlessly plan and exercise a disaster recovery scenario by letting you:
- Take inventory of all your IT components by adding them to the platform, which then monitors them in real-time
- Gain complete visibility into the IT infrastructure with the help of detailed InfraMaps
- Take manual or scheduled backups of all device configurations, irrespective of the vendor or environment
- Self-service application disables and enables to the application teams, thereby reducing dependence on the network teams and delays caused by back and forth communication
- Fast-track change management with a streamlined issue tracking system
- Empower teams to collaborate freely over integrated ChatOps tools
- Perform automated failovers with event-driven workflows, context-aware troubleshooting and auto-remediation (in case of snags during failover) – eliminating almost completely manual processes and errors
- Enforce security and compliance all along the DR test with policy-based automation
Use Case – Here’s how one of our customers restored their IT infrastructure in a matter of minutes after experiencing an outage due to a hurricane:
- A leading financial technology company used AppViewX to get their 300+ VoIP phones up and running in a little over 30 minutes, following a power outage caused by a hurricane.
- The network engineers built a workflow to automatically execute the CLI commands for rebooting the phones using the platform’s visual workflow studio, which they then shared with a number of teams.
- The teams could self-service the restoration workflows, and could therefore push several phones online simultaneously, eliminating overheads.
- Manually executing the restoration would’ve taken the network engineers well over 40 minutes for each phone, taking the back-and-forth exchange of tickets into account, and it would’ve been hours before all the phones were restored.