technical: A Sample z/OS Disaster Recovery Plan
Mainframes are important. Very important. Critical, in fact. So what do you
do when they become unavailable when there's a disaster? This disaster
could be anything from a disk failure, an application 'running amok' and corrupting
data, or the entire data centre blowing up. In fact, the number of potential
disasters is huge.
Every computer site (mainframe or otherwise) has to have a plan for such disasters.
This disaster recovery (DR or business continuity) plan needs to be more than
just a pile of paper. It must include detailed recovery steps - steps that have
actually been tested, and recently. In this article, I'm going to lay out a
sample plan to recover a z/OS system. Based on a few different plans I've seen
at various sites, it will try to cover all potential disasters by preparing
for the worst where the entire data centre is unavailable.
What I'm covering is only a small part of an overall DR plan, which would include
issues such as recovering networks, databases, production data, email, telephone
services and other computing platforms.
Our Systems and DR Site
Let's look first at our example production z/OS system. We will have a single
z/OS image using RACF, DFDSS (or DFSMSdss to be formal), and DFHSM (DFSMShsm).
We will have tape management software such as CA-1, DFSMSrmm or BMC Control-M/Tape.
We will be restoring our z/OS system on hardware at a remote DR site
a 'hot site'. This site will have all the hardware we need processor,
DASD, and tape drives (but no ATL). This is what most DR site providers will
offer.
However this DR site won't have any connection to our production site (no
remote copy or PPRC), and won't be running our z/OS. So in a disaster, we need
to restore everything from our backups.
Preparations
Any plan is based around backups. We will have two types of backups: weekly
and daily.
Weekly (every Sunday)
- Full pack backup of critical system packs a DFDSS physical backup.
- Full pack backup of DR One Pack System
a DFDSS physical backup.
- Job to check that all critical datasets have been backed up. Any datasets
not backed up are backed up using a DFDSS logical backup.
- Printout list of backup datasets, and which tape volume serial number they
are on. The tape management software will have a standard report for this.
- Printout a list of all system datasets, and which tape volume serial number
they are on. Again, the tape management software will have a report for this.
- All tapes from the above backups are removed from the Automated Tape Library
(ATL) and sent to a safe location offsite.
So what we're doing is backing up all our critical system packs sysres,
catalog, TSO & ISPF datasets, the works. These packs will all be in one
storage group. We're also backing up our one pack
system an emergency z/OS system that sits on one volume.
We have a job that checks all critical datasets have been backed up. This
job runs an assembler program that checks for any datasets allocated by system
tasks that are not in the system storage group. Any found are backed up using
logical DFDSS additional protection in case system datasets are placed
incorrectly.
Daily (every weekday evening)
- Backup of databases, journals and other production data (database backup
utilities or logical DFDSS).
- Backup of all datasets (DFHSM daily backup - these stay onsite).
- Backup of selected system datasets:
- RACF databases (DFDSS logical dump)
- HSM control datasets (DFHSM BACKVOL CDS command)
- Automated Job Scheduler datasets (job scheduler utility)
- Tape management catalog (tape managment utility)
- ICF Catalogs (IDCAMS or DFDSS, and DFHSM just to be sure)
- Produce a hardcopy list of offsite tapes, and what is on each. We need this
to restore our system datasets, and in case we can't recovery our tape catalog.
- Produce a list of critical datasets, what DASD volumes they were on, and
where they are backed ups
- All tapes from the above backups are removed from the Automated Tape Library
(ATL) and sent to the offsite location.
So we're doing daily backup of volatile systems datasets, as well as normal
production and database backups. Take a look at the order in which we're working.
Production and database backups first, followed by volatile system datasets,
then the ICF catalogs (with all the backups we've just done), and finally the
tape management catalog (with all the tapes we've just used for our backups).
Once all this is done, the reports are prepared. This order is important, we
want our ICF catalogs and tape management catalog to hold all the latest information
- including the latest backups.
At the Offsite Storage Location
We have a vault at some, safe offsite location preferable not too close
to our production data centre. At this location we will have:
- All our offsite backups
- A tape with standalone ICKDSF and DFDSS.
- A copy of the latest DR documentation.
- IPL parms documentation with latest sysparm information.
- A list of all tapes, and what's on them on paper or CD.
- A list of all system and production datasets, and what tape they're backed
up on. Also on paper or CD. We won't have a tape management system until our
z/OS and tape management catalogs are restored.
- Copies of all product manuals (IBM and ISV) on CD covering our software
versions.
To speed things up, we will have a couple of boxes kept separate, with the
DR documentation, standalone ICKDSF and DFDSS, one pack system backups, and
system storage group backups. This lets the systems programmers start restoring
fast, without wading through hundreds of tapes.
Our DR documentation will be reviewed and updated during every DR test. This
will include step-by-step instructions on how to restore the z/OS system. Ideally,
a systems programmer with no knowledge of our systems should be able to follow
these instructions and restore our system.
Restoring Our System
So we're now at our hot site with a whole lot of tapes. Here's what we do:
- One pack system recovered from weekly backup (standalone restore). This
gives us a z/OS system we can use to restore all the system datasets, and
do a few other things. Some hot sites may provide a z/OS system for you.
- One pack system IPLed.
- System storage group restored from weekly full pack backup
- Any critical system datasets not on the systems packs restored from logical
DFDSS backup.
- Volatile system datasets restored from logical DFDSS backup.
- JES spool re-formatted. It doesn't make sense to backup the JES spool,
so we will initialise volumes, and recreate the spool.
- Page datasets re-defined if they were not on the system packs.
- Production system IPLed. Latest IPL parms should be documented somewhere,
together with IODF etc.
- JES COLD start with SPOOL re-format.
- All necessary systems started tasks (including job scheduler, automation
software, tape management software, and DFHSM) re-started.
- Basic functionality checked
From here, we should have our production z/OS up and running again. We can
now get our network up and running, restore databases and production data, run
batch streams, and get our online systems up again.
Note that whoever restores z/OS will need a RACF administrator logon to make
any RACF changes necessary, and to restore datasets they may not normally have
access to.
Testing the DR Plan
We will perform a full test of the plan once a year. This means going to the
hot site and restoring our systems. Before each test we will check:
- That we have a copy of the DR site IODF on our one pack DR system and production
system.
- All ISV products with a CPUID dependent license code will work at the DR
site.
- Cryptographic master keys in cryptographic engines at DR site are OK.
- Tape drives and available DASD at DR site are still suitable and sufficient.
Regularly conducting DR tests with no notice is a great way of testing our plans
in a more real environment.
The Benefits of this Plan
This plan provides a way of recovering our z/OS system at a remote hot site.
But it also provides facilities to recover from all sorts of disasters, including:
- Catalog failure. We have daily catalog backups (both DFDSS and DFHSM) to
restore from. We will also having something like ICFRU
or T-REX
to forward recovery our catalogs from SMF records.
- Tape catalog failure. We have hardcopy reports that let us find out latest
tape catalog backups.
- Critical dataset or volume failure. From RACF databases to JES SPOOL, we
have facilities to restore or recreate all system datasets, and a one pack
system from where we can do it.
- ISPF dataset error. All it takes is for one critical ISPF dataset to go
missing to stop anyone from using TSO/ISPF. We will have a 'bare bones' ISPF
profile with the minimum to edit datasets, daily backups with HSM so a system
command will restore any problem dataset, and our one pack system to IPL if
all else fails.
- Security error. I remember one case where a security administrator removed
all access to sys1.vtamlst, stopping the whole system. Switching to the backup
RACF database, or restoring it from backups fixes the problem.
- Production dataset corruption or error. We can use the daily DFHSM backups,
or the offsite logical DFDSS backups to restore individual datasets.
This certainly isn't a comprehensive DR plan, nor is it the only way to do
things. But it shows one way that it can be done, and introduces some of the
issues to consider when creating your own DR plan.
David Stephens
|