opinion: Seven Best Practices for z/OS Availability
As z/OS systems programmers, we're lucky enough to work on the most stable computing platform and operating system available. We have more control, more information, and more ways to minimise downtime and maximise availability. This is all great, but errors and problems can still occur. The good news is that there are ways to minimise these errors and problems, and the downtime they cause.
So here are my seven top ways to maximise z/OS availability, and minimise downtime. Most of these are best-practices that most systems programmers and shops are already doing.
1. Health Checker
Implementing the IBM Health Checker that comes free with z/OS is a no-brainer. It's easy to setup, easy to use, and consumes little CPU or other resources. The first time you switch it on, it always finds some areas of your z/OS system you hadn't thought of. If you haven't enabled it yet, you will with z/OS 2.1 - it's started by default.
An add-on to the Health Checker, Predictive Failure Analysis (PFA) is another freebie with z/OS that helps detect soft errors - errors that by themselves may not be a problem, but may indicate or become one. Think of your JES spool filling up, or running out of CSA. Again, easy to setup, easy to use, little CPU or resource overhead.
2. Six Months Behind Up-to-Date
Keeping current with software maintenance is key to maximising uptime. However I don't like to keep that up-to-date. My rule of thumb is to apply Hipers as soon as possible and other PTFs six months after they're released. This lets others find any bugs from PTFs. If I'm installing new software, I like software to be 6-12 months old before installing it - the bleeding edge is not where I want to be.
IBM over the years has introduced the PROGxx parmlib members to create linklists, add libraries to APF and add exits. However many sites still use the old LNKLSTxx, IEAAPFxx and EXITxx parmlib members. Using PROGxx has many advantages. One is that if there is a problem with the member or statement, it doesn't stop the IPL. The PROG statement fails, and everything moves on. Put as much as you can into PROGxx to reduce chances of a Parmlib or dataset change stopping and IPL in its tracks.
4. Always have a z/OS System
I hate not having a z/OS system up and running. If I IPL a system, I want another z/OS system up in case of problems. This is possible with those in a Parallel Sysplex - most of the time. However hardware upgrades, DASD moves and other maintenance may require the entire Sysplex to be brought down. This is where my One Pack System gives me piece of mind. If all else fails, my regularly tested One Pack System comes to the rescue.
5. No Super Logons for Everyday Use
Systems programmers often need to do things that require a high-powered logon. As do DASD administrators, security administrators and others. However these high-powered logons should only be used when needed. For day-to-day use, another, normal logon should be used. This prevents accidents with high-powered logons.
6. Always Have a Plan B
What do you do if someone modifies all your TSO logon procs, stopping anyone from logging on. Or a pack with one dataset used in all TSO logon procs fails? How about if a user catalog needed for any TSO logon fails? Or if you've shutdown all your z/OS systems, and none will come up again - parmlib failure, master JCL failure, VTAM failure?
If you don't know what you would do, then you need to do some serious disaster recovery planning. When many think about disaster recovery, they think about backups, hot sites, GDPS and similar. However there are many other potential problems that could occur. If they do, you need a Plan B. Let's look at some of my favourite Plan Bs.
A TSO logon proc with the absolute minimum dataset allocations is very handy to get ISPF up and running if other datasets become unavailable.
Duplex JES2 checkpoints, and a spare SPOOL volume I can add if SPOOL suddenly fills up.
Spare page datasets I can add if the normal ones fill up.
A minimal VTAM and TCPIP configuration that can be used in emergencies.
An emergency parmlib LOADxx and friends for to startup a minimal z/OS system.
A RACF (orCA- ACF2 or CA Top Secret) backout plan in case someone removes RACF access to sys1.vtamlst (I've seen this happen) or similar.
And there are so many more. But you get the idea. I like to try and think of problems, and see if I can figure out how to fix it.
Plan B also applies to any changes. What if the change doesn't work. How to undo it. What if the change breaks something else - how to fix it?
7. Review and Change Control
Look back through your problem management systems, and you'll see that most of the problems are from human error. And we've all done it - entered a command on system PROD rather than system DEV, or deleted dataset A rather than dataset B. Perhaps the best way to minimise human error is to have someone else check your changes. Set the changes up, include instructions, and then have someone else check it.
When you think about it, add this to a Plan B for changes, and you have a good chunk of a change control system. And change control is the best way to minimise damage and outages from changes. I've worked at sites with and without change control, and although I can get frustrated with change control, it's better.
Oh yes, and that means no 'on the fly' changes.