opinion: Reliability is in the Details
Let's talk about reliability. Everyone will tell you that z/OS is the most reliable platform in the world. Just keeps on chugging along. What they may not tell you is that much of this reliability is in the details.
When I Arrive At a Site
When I start at a new site, one of the first things I do is look at the z/OS syslog and logs of other major subsystems like IMS and CICS. I do this for all projects: be it to reduce CPU, fix a problem, improve performance, or something else.
These logs tell me a lot about the system. And I'm not talking about the content of the messages themselves.
Let's take an example. One site had a CICS system experiencing a handful of storage violations every day. Now, I don't like storage violations – it means that some storage has been overwritten, which scares me. But the client didn't see this as a big problem: they had been occurring for a long time.
Another site had the message
DFHSO0124 The MAXSOCKETS system initialization parameter has a value of
32,000 which exceeds the MAXFILEPROC value of 12,000.
The MAXSOCKETS value has been set to the lower value
every time CICS started up. This message isn't important, the MAXSOCKETS value is just set to the maximum possible value by CICS. It can be left alone.
Another site had the message
DFS0403W IMS REGISTER CALL TO MVS ARM FAILED - RETURN CODE= 0C,
REASON CODE= 0160
at every IMS startup. In this case, IMS is attempting to register with z/OS Automatic Restart Manager (ARM), but ARM has not been setup on this z/OS system. No real problem: the client used an automation package to automate the restart of IMS.
I actually go a bit further. I always look at the Health Checker output to see if there are any that are in EXCEPTION status.
I know what you're thinking: "these don't seem like big problems." And you're right. The site with the storage violations had been running fine with them for years. And the CICS and IMS messages don't indicate a problem: they're not affecting anything. Most Health Checker exceptions I've seen are small, and don't affect how the system runs. So what's my problem?
I Like Clean Systems
Like many older systems programmers, I was taught from the beginning to look after the details. When installing software, the installation should be perfect. If not, errors, no matter how small, should be found and fixed.
If changing configuration parameters, each parameter should be well understood before any change. And any change had to be tested.
If writing code, the code should compile, bind and execute without errors. What's more, there should be error handling to recover from errors, or at least output messages with enough information to fix them.
In those days, the details were important. z/OS (well, MVS in those days) wasn't as solid as it is today, and this was how we got our reliability.
You can probably see where I'm going here. Yes, I like a z/OS system with no error messages, no matter how small or unimportant. I like CICS systems with no storage violations, batch jobs that end with a zero return code unless there is a problem. I like Health Checker to be completely clean: no exceptions.
I believe that if you have such a system, you'll avoid problems and improve resilience. You'll also see any problem that does occur a lot easier. This is just as true today as when I started out as an MVS/XA systems programmer in the 1980s.
The number of these 'low impact' messages and exception health checks that I find also tells me a lot about the site I'm working at. Let's be realistic. Few sites today will have my perfect z/OS system. IT groups are busy, and there's rarely spare time to go looking for problems. In fact, if I see such a perfect site (and I have), then I know that the IT staff are really switched on, and have excellent control of their systems.
And that's the thing. I believe a site with many of these 'niggly' problems is one where the IT staff aren't dealing with the details. I'm sure that these sites are not as robust. That may be because the 'niggly' problem isn't that niggly. Or it may be because new error messages are harder to find amongst many other 'normal' error messages. Or it may be because the site takes less care when implementing new features or changes.
Think of a z/OS system like a ship at sea. If you go around and all ropes are neatly coiled, everything put away properly, decks freshly painted; then you're confident that the ship's captain has good control of the ship. But if there are ropes everywhere, crew regularly can't find the tools they need, and rust all over the decks; you'll want to know where your life jacket is.
In our examples, I'd investigate and fix every storage violation (yes, I know these are hard to fix). I'd change the CICS SIT MAXSOCKETS parameter to 12,000, or increase the TCP/IP MAXFILEPROC parameter. I'd set the IMS execution parameter ARMRST=N so that IMS didn't try to register with ARM. And I'd resolve every Health Checker check that is in EXCEPTION, or inactivate them if they're not relevant.
The key to high reliability is in the details.