management: Disaster Recovery in Depth: More than Offsite Recovery
When I talk about disaster recovery (DR) at client sites, the conversations quickly move to the offsite disaster recovery.
We talk about their plans to recover the entire site in the case of a disaster, or if there are major hardware or environmental problems. And this is a really important part of any overall disaster recovery plan. But there's more to DR than offsite recovery.
For a truly resilient system, we need recovery in depth.
When I talk about 'wholistic' disaster recovery, I often talk about having disaster recovery options and plans in depth. To explain this, I divide DR into different 'rings.' The final diagram looks like this:
Let's work our way from the inside out: starting with tasks.
In any operating system, we have different threads (tasks or SRBs in z/OS). For example, a simple JCL step, a TSO command, a CICS transaction, a z/OS UNIX thread. If this fails, we need to do some recovery. Open files need to be closed, memory obtained needs to be released, incomplete units of work backed out, diagnostic data output so we know what happened, and can fix it.
Most people don't think too much about task recovery, because there are features that do it automatically. CICS automatically recovers from a failing CICS transaction. z/OS and JES do the same when a batch step fails. z/OS UNIX manages a failing UNIX thread, and TSO handles the TSO commands.
Our problem may not be a failure. Our task may go into a loop. In this case, CICS has features to abend such tasks, as do some monitors like Tivoli Omegamon and CA SYSVIEW.
But task recovery can't be taken for granted. Anyone doing new development will be thinking about task recovery. For example, let's take a CICS transaction. When is a unit of work finished? Should we take syncpoints, or let CICS syncpoint at the end of the transaction? There may be some failures that we want to handle ourselves using EXEC CICS HANDLEs. We may want to rollback units of work when a DB2 request fails, or write messages to a VSAM file when we can't use an MQ queue.
Batch jobs are another issue. If we're creating a batch job, what should we do when if fails? Is it a big problem that must be addressed immediately? If we want to restart the job, should we restart it from the first step, or some other step? Do we need to recover files that the job updated before retrying the job?
The next step out in our diagram is process recovery: recovery of address spaces or subsystems. For example, if DB2, CICS or MQ fails.
In this case, we need to recover the process. If we're lucky, there may be another redundant process that can pick up our workload (think CICSPlex, IMSPlex, MQ Queue Sharing Group or DB2 Data Sharing Group). Ideally, we'll have some automation that will automatically recover the failed process: z/OS ARM, or products like Tivoli Systems Automation.
In a perfect world, our automation will recover everything quickly. But of course, there will always be other problems we need to fix ourselves. For example, we may have a catalog or VSAM file error, network issue, security issue or more.
Most processes will have their own recovery procedures when they restart. For example, CICS automatically backs out any in-doubt units of work, and 'settles up' with other resource managers like DB2 and Websphere MQ.
In many sites I visit, these manual recovery processes are rarely tested. Most sites haven't recovered a catalog in years, or tested their RACF database recovery procedures. Some won't have tried to logon to TSO with an 'emergency' TSO procedure, or tested their emergency VTAM startup that starts the bare minimum to logon.
Many won't have done some resilience testing for a long time: for example, crashing an MQ subsystem and see how the CICS applications handle it. Or force a CICS maxtasks and confirm that automation handles the situation.
Next on our journey outwards is a system failure: z/OS fails. This is rarely clean, and can have many shapes and forms. Again, it would be nice to have a second z/OS system to take over the workload. We'll hopefully have automation that will automatically perform any standalone dump and re-IPL the failing system (like the z/OS AutoIPL feature).
We'll have a plan in case we can't bring up any z/OS system: a single-pack system, a standalone editor, or well tested standalone restore utilities like standalone ICKDSF and DFSMSdss. Even better, we'll have tested all of these utilities, and have procedures available that explain how to use them.
Hardware failures are rare, but can happen. And it may not be the hardware's fault. A support engineer or administrator may make an error, or surrounding infrastructure like electrical or cooling may have problems.
Again, hopefully we have a redundant data centre equipment and system; maybe a second z/OS CEC or DASD subsystem with Hyperswap. Maybe two separate disaster zones with separate electricity, cooling and telecommunications. We may have an uninterruptible power supply to cover for power failures, and a generator for longer outages.
A location failure is where your data centre is gone. It has flooded, had a fire, or some major issue has taken out essential infrastructure. This is a major event, and our data centre isn't coming back for hours, or even days. This is when you recover to your offsite recovery zone.
Recovery Time and Impact
This diagram is simple, but highlights some interesting facts. First up, the time to recovery, or RTO. As we go towards the outside of the wheel, the time to recover increases. A task can fail and be recovered inside a second. A process failure may take a minute or two. A z/OS failure may take 30 minutes to recover. And it goes on.
Another thing it shows is the cost and impact of the recovery. As we move towards the outside, the systems and applications that are impacted by our recovery increases. If we IPL, we impact all processes on the z/OS system, even if they are working fine. If we restart our CICS region, we impact all the tasks running in the region; even if they are working fine.
So, it makes sense that we want the recovery to be as close to the centre as possible. We don't want to IPL if a single CICS region fails. We don't want to restart our CICS region if a task abends. So, our recovery plans and procedures should aim to recover at the lowest level: the closest to the centre of our diagram.
Recovery in Depth
Another thing the diagram shows us is that our offsite recovery is used in a location failure: only one of five rings of recovery. Performing offsite testing is essential, and may also provide some testing for the other rings. It may also be required for compliance and customer trust. However, there are four other rings that give us our recovery in depth. These other rings are far more likely to be used, and should not be forgotten.