longpelaexpertise.com.au/ezine/SampleDR.php?ezinemode=printfriend

LongEx Mainframe Quarterly - November 2009

technical: A Sample z/OS Disaster Recovery Plan

Mainframes are important. Very important. Critical, in fact. So what do you do when they become unavailable – when there's a disaster? This disaster could be anything from a disk failure, an application 'running amok' and corrupting data, or the entire data centre blowing up. In fact, the number of potential disasters is huge.

Every computer site (mainframe or otherwise) has to have a plan for such disasters. This disaster recovery (DR or business continuity) plan needs to be more than just a pile of paper. It must include detailed recovery steps - steps that have actually been tested, and recently. In this article, I'm going to lay out a sample plan to recover a z/OS system. Based on a few different plans I've seen at various sites, it will try to cover all potential disasters by preparing for the worst – where the entire data centre is unavailable.

What I'm covering is only a small part of an overall DR plan, which would include issues such as recovering networks, databases, production data, email, telephone services and other computing platforms.

Our Systems and DR Site

Let's look first at our example production z/OS system. We will have a single z/OS image using RACF, DFDSS (or DFSMSdss to be formal), and DFHSM (DFSMShsm). We will have tape management software such as CA-1, DFSMSrmm or BMC Control-M/Tape.

We will be restoring our z/OS system on hardware at a remote DR site – a 'hot site'. This site will have all the hardware we need – processor, DASD, and tape drives (but no ATL). This is what most DR site providers will offer.

However this DR site won't have any connection to our production site (no remote copy or PPRC), and won't be running our z/OS. So in a disaster, we need to restore everything from our backups.

Preparations

Any plan is based around backups. We will have two types of backups: weekly and daily.

Weekly (every Sunday)

Full pack backup of critical system packs – a DFDSS physical backup.
Full pack backup of DR One Pack System – a DFDSS physical backup.
Job to check that all critical datasets have been backed up. Any datasets not backed up are backed up using a DFDSS logical backup.
Printout list of backup datasets, and which tape volume serial number they are on. The tape management software will have a standard report for this.
Printout a list of all system datasets, and which tape volume serial number they are on. Again, the tape management software will have a report for this.
All tapes from the above backups are removed from the Automated Tape Library (ATL) and sent to a safe location offsite.

So what we're doing is backing up all our critical system packs – sysres, catalog, TSO & ISPF datasets, the works. These packs will all be in one storage group. We're also backing up our one pack system – an emergency z/OS system that sits on one volume.

We have a job that checks all critical datasets have been backed up. This job runs an assembler program that checks for any datasets allocated by system tasks that are not in the system storage group. Any found are backed up using logical DFDSS – additional protection in case system datasets are placed incorrectly.

Daily (every weekday evening)

Backup of databases, journals and other production data (database backup utilities or logical DFDSS).
Backup of all datasets (DFHSM daily backup - these stay onsite).
Backup of selected system datasets:
- RACF databases (DFDSS logical dump)
- HSM control datasets (DFHSM BACKVOL CDS command)
- Automated Job Scheduler datasets (job scheduler utility)
- Tape management catalog (tape managment utility)
- ICF Catalogs (IDCAMS or DFDSS, and DFHSM – just to be sure)
Produce a hardcopy list of offsite tapes, and what is on each. We need this to restore our system datasets, and in case we can't recovery our tape catalog.
Produce a list of critical datasets, what DASD volumes they were on, and where they are backed ups
All tapes from the above backups are removed from the Automated Tape Library (ATL) and sent to the offsite location.

So we're doing daily backup of volatile systems datasets, as well as normal production and database backups. Take a look at the order in which we're working. Production and database backups first, followed by volatile system datasets, then the ICF catalogs (with all the backups we've just done), and finally the tape management catalog (with all the tapes we've just used for our backups). Once all this is done, the reports are prepared. This order is important, we want our ICF catalogs and tape management catalog to hold all the latest information - including the latest backups.

At the Offsite Storage Location

We have a vault at some, safe offsite location – preferable not too close to our production data centre. At this location we will have:

All our offsite backups
A tape with standalone ICKDSF and DFDSS.
A copy of the latest DR documentation.
IPL parms documentation with latest sysparm information.
A list of all tapes, and what's on them on paper or CD.
A list of all system and production datasets, and what tape they're backed up on. Also on paper or CD. We won't have a tape management system until our z/OS and tape management catalogs are restored.
Copies of all product manuals (IBM and ISV) on CD – covering our software versions.

To speed things up, we will have a couple of boxes kept separate, with the DR documentation, standalone ICKDSF and DFDSS, one pack system backups, and system storage group backups. This lets the systems programmers start restoring fast, without wading through hundreds of tapes.

Our DR documentation will be reviewed and updated during every DR test. This will include step-by-step instructions on how to restore the z/OS system. Ideally, a systems programmer with no knowledge of our systems should be able to follow these instructions and restore our system.

Restoring Our System

So we're now at our hot site with a whole lot of tapes. Here's what we do:

One pack system recovered from weekly backup (standalone restore). This gives us a z/OS system we can use to restore all the system datasets, and do a few other things. Some hot sites may provide a z/OS system for you.
One pack system IPLed.
System storage group restored from weekly full pack backup
Any critical system datasets not on the systems packs restored from logical DFDSS backup.
Volatile system datasets restored from logical DFDSS backup.
JES spool re-formatted. It doesn't make sense to backup the JES spool, so we will initialise volumes, and recreate the spool.
Page datasets re-defined – if they were not on the system packs.
Production system IPLed. Latest IPL parms should be documented somewhere, together with IODF etc.
JES COLD start with SPOOL re-format.
All necessary systems started tasks (including job scheduler, automation software, tape management software, and DFHSM) re-started.
Basic functionality checked

From here, we should have our production z/OS up and running again. We can now get our network up and running, restore databases and production data, run batch streams, and get our online systems up again.

Note that whoever restores z/OS will need a RACF administrator logon to make any RACF changes necessary, and to restore datasets they may not normally have access to.

Testing the DR Plan

We will perform a full test of the plan once a year. This means going to the hot site and restoring our systems. Before each test we will check:

That we have a copy of the DR site IODF on our one pack DR system and production system.
All ISV products with a CPUID dependent license code will work at the DR site.
Cryptographic master keys in cryptographic engines at DR site are OK.
Tape drives and available DASD at DR site are still suitable and sufficient.

Regularly conducting DR tests with no notice is a great way of testing our plans in a more real environment.

The Benefits of this Plan

This plan provides a way of recovering our z/OS system at a remote hot site. But it also provides facilities to recover from all sorts of disasters, including:

Catalog failure. We have daily catalog backups (both DFDSS and DFHSM) to restore from. We will also having something like ICFRU or T-REX to forward recovery our catalogs from SMF records.
Tape catalog failure. We have hardcopy reports that let us find out latest tape catalog backups.
Critical dataset or volume failure. From RACF databases to JES SPOOL, we have facilities to restore or recreate all system datasets, and a one pack system from where we can do it.
ISPF dataset error. All it takes is for one critical ISPF dataset to go missing to stop anyone from using TSO/ISPF. We will have a 'bare bones' ISPF profile with the minimum to edit datasets, daily backups with HSM so a system command will restore any problem dataset, and our one pack system to IPL if all else fails.
Security error. I remember one case where a security administrator removed all access to sys1.vtamlst, stopping the whole system. Switching to the backup RACF database, or restoring it from backups fixes the problem.
Production dataset corruption or error. We can use the daily DFHSM backups, or the offsite logical DFDSS backups to restore individual datasets.

This certainly isn't a comprehensive DR plan, nor is it the only way to do things. But it shows one way that it can be done, and introduces some of the issues to consider when creating your own DR plan.

David Stephens