LongEx Mainframe Quarterly - November 2009
Mainframes are important. Very important. Critical, in fact. So what do you do when they become unavailable – when there's a disaster? This disaster could be anything from a disk failure, an application 'running amok' and corrupting data, or the entire data centre blowing up. In fact, the number of potential disasters is huge. Every computer site (mainframe or otherwise) has to have a plan for such disasters. This disaster recovery (DR or business continuity) plan needs to be more than just a pile of paper. It must include detailed recovery steps - steps that have actually been tested, and recently. In this article, I'm going to lay out a sample plan to recover a z/OS system. Based on a few different plans I've seen at various sites, it will try to cover all potential disasters by preparing for the worst – where the entire data centre is unavailable. What I'm covering is only a small part of an overall DR plan, which would include issues such as recovering networks, databases, production data, email, telephone services and other computing platforms. Our Systems and DR SiteLet's look first at our example production z/OS system. We will have a single z/OS image using RACF, DFDSS (or DFSMSdss to be formal), and DFHSM (DFSMShsm). We will have tape management software such as CA-1, DFSMSrmm or BMC Control-M/Tape. We will be restoring our z/OS system on hardware at a remote DR site – a 'hot site'. This site will have all the hardware we need – processor, DASD, and tape drives (but no ATL). This is what most DR site providers will offer. However this DR site won't have any connection to our production site (no remote copy or PPRC), and won't be running our z/OS. So in a disaster, we need to restore everything from our backups. PreparationsAny plan is based around backups. We will have two types of backups: weekly and daily. Weekly (every Sunday)
So what we're doing is backing up all our critical system packs – sysres, catalog, TSO & ISPF datasets, the works. These packs will all be in one storage group. We're also backing up our one pack system – an emergency z/OS system that sits on one volume. We have a job that checks all critical datasets have been backed up. This job runs an assembler program that checks for any datasets allocated by system tasks that are not in the system storage group. Any found are backed up using logical DFDSS – additional protection in case system datasets are placed incorrectly. Daily (every weekday evening)
So we're doing daily backup of volatile systems datasets, as well as normal production and database backups. Take a look at the order in which we're working. Production and database backups first, followed by volatile system datasets, then the ICF catalogs (with all the backups we've just done), and finally the tape management catalog (with all the tapes we've just used for our backups). Once all this is done, the reports are prepared. This order is important, we want our ICF catalogs and tape management catalog to hold all the latest information - including the latest backups. At the Offsite Storage LocationWe have a vault at some, safe offsite location – preferable not too close to our production data centre. At this location we will have:
To speed things up, we will have a couple of boxes kept separate, with the DR documentation, standalone ICKDSF and DFDSS, one pack system backups, and system storage group backups. This lets the systems programmers start restoring fast, without wading through hundreds of tapes. Our DR documentation will be reviewed and updated during every DR test. This will include step-by-step instructions on how to restore the z/OS system. Ideally, a systems programmer with no knowledge of our systems should be able to follow these instructions and restore our system. Restoring Our SystemSo we're now at our hot site with a whole lot of tapes. Here's what we do:
From here, we should have our production z/OS up and running again. We can now get our network up and running, restore databases and production data, run batch streams, and get our online systems up again. Note that whoever restores z/OS will need a RACF administrator logon to make any RACF changes necessary, and to restore datasets they may not normally have access to. Testing the DR PlanWe will perform a full test of the plan once a year. This means going to the
hot site and restoring our systems. Before each test we will check:
The Benefits of this PlanThis plan provides a way of recovering our z/OS system at a remote hot site. But it also provides facilities to recover from all sorts of disasters, including:
This certainly isn't a comprehensive DR plan, nor is it the only way to do things. But it shows one way that it can be done, and introduces some of the issues to consider when creating your own DR plan. |