technical: Five Things That Stopped or Crashed z/OS
Today I'm going to give some 'ghost stories', stories from my past when I saw a z/OS system crash, or at least come down to its knees. But I'm going to do more than try to scare you. I'm also going to discuss how each problem was solved, including any steps that didn't work. But I'm not going to stop there. These ghost stories are from my past as a z/OS systems programmer, and so are quite a few years old. So I'm going to talk about whether they are still relevant today.
I must admit that some of these problems were my own fault: when they were, I'm owning up. You'll also see that I sometimes use the term 'I', meaning that I personally did the recovery. But I didn't do all of them: in other examples, I use 'we' to indicate that someone else did the recovery, and I watched or helped.
Error 1: User Catalog Failure
I've actually had to recover a production user catalog three times in my past: twice from a DASD failure, and once where the user catalog was accidentally deleted.
The problem with losing user catalogs is that so many things rely on them - especially TSO/ISPF sessions. So in all my examples, normal TSO/ISPF functionality was lost as they tried to allocate datasets cataloged in the missing user catalog. Luckily, z/OS can still function from the master catalog, though many systems didn't take too well to a missing catalog. Interestingly, many online subsystems continued to operate as they weren't trying to allocate anything new.
To recover the user catalog in each case, I
- Recovered the user catalog from daily catalog backups. We were a little paranoid (as you would be if you lost a couple of user catalogs in the past), and backed them up three times: once in a batch job using IDCAMS EXPORT, once in a batch job using IBM DFSMSdss, and once using DFSMShsm automatic backup. Dataset names and tape volume serial numbers of all these backups were printed out every night. So we could find out the dataset name and tape volume, and code a job with DSN and VOL=SER.
- Forward-recovered the catalog from SMF records. We used what is now Rocket Software's Catalog Solution. However IBM provides ICFRU free with z/OS, and there are some other products like CA Crews and Dino-Software's T-REX.
I know what you're thinking: how did we logon to submit the restore job when TSO/ISPF was down? I'll cover that soon.
This sort of problem is unlikely to happen today: DASD is more reliable. However it's not impossible (and one of my recoveries was needed when a catalog was accidentally deleted - by me). So catalog recovery is still important. And the recovery steps are exactly the same.
Error 2: Deleted ISPF Dataset
One of our systems programmers updated TSO logon procedure JCL, changing a dataset name hard-coded in the JCL. The result was that any new logons failed, as the new dataset name was not spelled correctly. Unfortunately, this proc was used by a lot of people, including systems programmers and operations staff. So although this didn't kill z/OS, it stopped any new TSO users from logging on. At the time we used IBM Infoman as our core problem management system (TSO based), and were ramping up our electronic storage of reports using IBM RMDS (TSO/ISPF based). So it wasn't just technical staff that were affected.
Of course the change was done early in the morning, maximising the damage as people tried to logon at the beginning of their working day.
Fortunately, the solution wasn't took difficult. We had another 'emergency' TSO logon procedure for use in this situation: one that only used the bare minimum datasets needed to get TSO running. From here we could manually allocate ISPF datasets, and get ISPF up and running to fix the problem. TSO edit would have been another option, but fortunately we were able to get ISPF working.
All datasets used in this logon procedure were cataloged in the master catalog. That's how we were able to submit our user catalog recovery job mentioned previously.
So, we logged onto the emergency logon procedure, and fixed the problem.
Today, most sites try to avoid datasets coded in the TSO logon procedure, and dynamically allocate the rest in a REXX or CLIST. However this can still cause problems if an ISPF library is unavailable, preventing a user from accessing ISPF. Some sites are smart enough to have code to handle this kind of abend, so even if datasets are missing you can still get into ISPF. Still, this problem could happen today. The recovery steps are unchanged.
SMS is another interesting issue. So it's wise to have a TSO logon procedure that doesn't need SMS managed datasets. This allows you to fix any SMS configuration or other problems.
Error 3: Security Rule
In one site we were using CA-ACF2, and a security administrator changed a security rule (not my fault this time). Unfortunately, this rule was for SYS1.VTAMLST, and so VTAM lost its access privileges to this dataset. And crashed. When we lost VTAM, most things crashed around it - it was ugly.
CA-ACF2 and RACF can both duplicate security databases. So if one fails, the alternate can be used. We used the CA-ACF2 z/OS operator SWITCH command (RVARY does the same thing in RACF) to manually switch to the alternate database, hoping that the rule hadn't already been copied to the alternate. It had.
Our next step was to restore the security database. We IPLed our one-pack emergency z/OS system and restored the database back to the previou evening. We then re-IPLed the system, and all was good. The security administrator who made the original error kept their job after some 'counselling.'
This kind of user error is still possible. The recovery steps and options remain the same. Many sites don't have a one-pack emergency system. There are several other options, including having a started task that can be started from the z/OS console to restore the database, or (in the case of RACF) an emergency ICHRSDNT pointing to an emergency RACF database to allow logons.
Error 4: Stop LLA
This is a classic error. One of our systems programmers could not remember the command to refresh the linklist (LLA), so decided to stop and restart it instead. We've all heard that LLA is important and should always run. But to see it in action is another thing. Performance for the system slowed to an almost stop, and TSO/ISPF users were locked out.
The solution was simple: start LLA again. Unfortunately, this took about 30 minutes, so we had effectively a 30 minute outage of our z/OS systems.
Again, this type of error is still possible, though hopefully most operators are using automated operations products rather than issuing z/OS commands directly to start and stop subsystems. The solution is the same.
Error 5: Parmlib Error
OK, this one was my fault. I did a SYS1.PARMLIB change. But I made an error in the IEASYSxx member - a syntax error. So when we tried to IPL, the system would not come up, stating that there was an error with SYS1.PARMLIB.
This sounds worse than it is. We were IPLing during our normal change window, which had enough time built in to handle such IPL issues. Our site had a policy of incrementing the parmlib members whenever there was a change. So as to change IEASYS13 I created a new IEASYS14 member, copied in IEASYS13, and made my changes. The system was IPLed using IEASYS14, and failed. So I IPLed the system from IEASYS13, backing out the change.
We also had an emergency parameter members (IEASYS99 and others), which included the absolute minimum needed to bring up z/OS, TSO and ISPF. So if there were other failures, we could IPL using IEASYS99, fix any problems and re-IPL.
Today z/OS is more forgiving with parmlib syntax errors, and gives you a chance to fix them in the IPL process. However other errors (not necessarily syntax errors) could still occur. Every site will have a mechanism to backout parmlib changes, and also (hopefully) have emergency parmlib members like our IEASYS99.
Conclusion
In a site I've been working, their disaster recovery procedures consist of regular offsite recoveries - they bring up their systems in their hot-site. However you can see from these examples that a full offsite recovery isn't a 'fix-all' solution, and in many cases isn't necessary. So it's standard to have procedures in place to deal with some of these 'smaller problems'. Ideally, they will also be tested regularly.
David Stephens
|