Longpela Expertise logo
Longpela Expertise Consulting
Longpela Expertise
Home | Press Room | Contact Us | Site Map

LongEx Mainframe Quarterly - May 2019

opinion: Reliability is in the Details

Let's talk about reliability. Everyone will tell you that z/OS is the most reliable platform in the world. Just keeps on chugging along. What they may not tell you is that much of this reliability is in the details.

When I Arrive At a Site

When I start at a new site, one of the first things I do is look at the z/OS syslog and logs of other major subsystems like IMS and CICS. I do this for all projects: be it to reduce CPU, fix a problem, improve performance, or something else.

These logs tell me a lot about the system. And I'm not talking about the content of the messages themselves.

Let's take an example. One site had a CICS system experiencing a handful of storage violations every day. Now, I don't like storage violations – it means that some storage has been overwritten, which scares me. But the client didn't see this as a big problem: they had been occurring for a long time.

Another site had the message

DFHSO0124 The MAXSOCKETS system initialization parameter has a value of 
32,000 which exceeds the MAXFILEPROC value of 12,000.
The MAXSOCKETS value has been set to the lower value
every time CICS started up. This message isn't important, the MAXSOCKETS value is just set to the maximum possible value by CICS. It can be left alone.

Another site had the message

at every IMS startup. In this case, IMS is attempting to register with z/OS Automatic Restart Manager (ARM), but ARM has not been setup on this z/OS system. No real problem: the client used an automation package to automate the restart of IMS.

I actually go a bit further. I always look at the Health Checker output to see if there are any that are in EXCEPTION status.

I know what you're thinking: "these don't seem like big problems." And you're right. The site with the storage violations had been running fine with them for years. And the CICS and IMS messages don't indicate a problem: they're not affecting anything. Most Health Checker exceptions I've seen are small, and don't affect how the system runs. So what's my problem?

I Like Clean Systems

Like many older systems programmers, I was taught from the beginning to look after the details. When installing software, the installation should be perfect. If not, errors, no matter how small, should be found and fixed.

If changing configuration parameters, each parameter should be well understood before any change. And any change had to be tested.

If writing code, the code should compile, bind and execute without errors. What's more, there should be error handling to recover from errors, or at least output messages with enough information to fix them.

In those days, the details were important. z/OS (well, MVS in those days) wasn't as solid as it is today, and this was how we got our reliability.

You can probably see where I'm going here. Yes, I like a z/OS system with no error messages, no matter how small or unimportant. I like CICS systems with no storage violations, batch jobs that end with a zero return code unless there is a problem. I like Health Checker to be completely clean: no exceptions.

I believe that if you have such a system, you'll avoid problems and improve resilience. You'll also see any problem that does occur a lot easier. This is just as true today as when I started out as an MVS/XA systems programmer in the 1980s.

The number of these 'low impact' messages and exception health checks that I find also tells me a lot about the site I'm working at. Let's be realistic. Few sites today will have my perfect z/OS system. IT groups are busy, and there's rarely spare time to go looking for problems. In fact, if I see such a perfect site (and I have), then I know that the IT staff are really switched on, and have excellent control of their systems.

And that's the thing. I believe a site with many of these 'niggly' problems is one where the IT staff aren't dealing with the details. I'm sure that these sites are not as robust. That may be because the 'niggly' problem isn't that niggly. Or it may be because new error messages are harder to find amongst many other 'normal' error messages. Or it may be because the site takes less care when implementing new features or changes.


Think of a z/OS system like a ship at sea. If you go around and all ropes are neatly coiled, everything put away properly, decks freshly painted; then you're confident that the ship's captain has good control of the ship. But if there are ropes everywhere, crew regularly can't find the tools they need, and rust all over the decks; you'll want to know where your life jacket is.

In our examples, I'd investigate and fix every storage violation (yes, I know these are hard to fix). I'd change the CICS SIT MAXSOCKETS parameter to 12,000, or increase the TCP/IP MAXFILEPROC parameter. I'd set the IMS execution parameter ARMRST=N so that IMS didn't try to register with ARM. And I'd resolve every Health Checker check that is in EXCEPTION, or inactivate them if they're not relevant.

The key to high reliability is in the details.

David Stephens

LongEx Quarterly is a quarterly eZine produced by Longpela Expertise. It provides Mainframe articles for management and technical experts. It is published every November, February, May and August.

The opinions in this article are solely those of the author, and do not necessarily represent the opinions of any other person or organisation. All trademarks, trade names, service marks and logos referenced in these articles belong to their respective companies.

Although Longpela Expertise may be paid by organisations reprinting our articles, all articles are independent. Longpela Expertise has not been paid money by any vendor or company to write any articles appearing in our e-zine.

Inside This Month

Printer Friendly Version

Read Previous Articles

Longpela Expertise can manage mainframe costs, projects and outsourcing agreements. Contact us to get your own independent mainframe expert.
© Copyright 2019 Longpela Expertise  |  ABN 55 072 652 147
Legal Disclaimer | Privacy Policy Australia
Website Design: Hecate Jay