Longpela Expertise logo
Longpela Expertise Consulting
Longpela Expertise
Home | Press Room | Contact Us | Site Map

LongEx Mainframe Quarterly - November 2016

opinion: Reliability and Resilience: Only Important After Outages?

While working on a performance project recently, my client asked for some help with assessing and improving the reliability, or resilience of their application. This is interesting because it's the first time someone has ever asked this.

As a consultant, I do what I'm paid to do. So if someone wants to reduce CPU, that's what I do. Or if they want their online transactions to run faster, or their batch schedule to end sooner, that's what I do. However I'm always looking for 'value-add' - extras that I can give my clients. So if I'm reducing CPU to reduce costs, I may talk to them about other ways of reducing costs: reviewing the software products they have, looking at their invoices, considering capping. This helps the client, and may lead to further opportunities for my firm.

So if I'm at a site and I see something that may impact their availability, I'll talk to them about it. For example, at one client I saw thousands of CICS application abends (and a couple of storage violations) every week. Something to talk about. At another, I saw that they hadn't activated the RACF TEMPDSN profile, a small security exposure. Something to talk about.

But before now, I've never been paid to help improve reliability or resilience. And in this case, this only happened after a production outage - a big one. But why?

Let's look at the good news. Mainframes, z/OS and the associated subsystems (CICS, IMS, DB2, and Websphere MQ) are very, very reliable. So reliable, that it's a major shock if we see an error. So with such robust infrastructure, it's easy to say "it's all reliable, nothing to do here."

There's more good news: most mainframe applications are very stable and reliable. They've been working for decades, so in most sites their mainframe applications are the most stable, and quietly hum along with few, if any, issues. So again, it's easy to say "if it ain't broke, don't fix it."

But let's take a reality check Firstly, things change. Take workloads. I've been working with a team on an application's performance. This application was written when there wasn't much workload. But as the workloads have jumped up and up, the application is starting to struggle. Today we're seeing response times of over seven seconds (not milliseconds, seconds).

Applications also change, and with change comes risk. In one client, I regularly report on the number of CICS transaction abends, by application. This application is regularly updated to keep up with business, compliance, and yes, performance requirements. So I regularly see some transactions with increased abend counts - many caused by recent application changes.

Systems also change. Hardware is upgraded, operating systems are upgraded. These upgrades may cause problems, or even 'uncover' existing problems that only appeared with the new features or performance of an upgrade. In one example, I saw a DASD upgrade that actually decreased performance, affecting the response times of an application I was studying.

So even with mainframes, reliability isn't a given. What's worse, reliability and resilience are a hard nut to crack. It's expensive to improve reliability and resilience, and it can be difficult to get budget approval for something that returns nothing other than 'business as usual.' In many cases, resilience and reliability only get the attention they deserve after some major outages. That's when the true value of resilience is seen.

For my client, they'll get a lot of benefit from getting me to look at resilience. I'll highlight abends, locate potential bottlenecks, and review infrastructure - from Websphere MQ pageset and queue depth settings to z/OS CPU provisioning. I'll work with their local teams to understand better the application structure and workloads, and look for areas that may need more work. In the end, their systems will be better prepared to quickly process current workloads, and the workloads in the future.

David Stephens

LongEx Quarterly is a quarterly eZine produced by Longpela Expertise. It provides Mainframe articles for management and technical experts. It is published every November, February, May and August.

The opinions in this article are solely those of the author, and do not necessarily represent the opinions of any other person or organisation. All trademarks, trade names, service marks and logos referenced in these articles belong to their respective companies.

Although Longpela Expertise may be paid by organisations reprinting our articles, all articles are independent. Longpela Expertise has not been paid money by any vendor or company to write any articles appearing in our e-zine.

Inside This Month

Printer Friendly Version

Read Previous Articles

Longpela Expertise can manage mainframe costs, projects and outsourcing agreements. Contact us to get your own independent mainframe expert.
© Copyright 2016 Longpela Expertise  |  ABN 55 072 652 147
Legal Disclaimer | Privacy Policy Australia
Website Design: Hecate Jay