opinion: Reliability and Resilience: Only Important After Outages?
While working on a performance project recently, my client asked for some help with assessing and improving the reliability, or resilience of their application. This is interesting because it's the first time someone has ever asked this.
As a consultant, I do what I'm paid to do. So if someone wants to reduce CPU, that's what I do. Or if they want their online transactions to run faster, or their batch schedule to end sooner, that's what I do. However I'm always looking for 'value-add' - extras that I can give my clients. So if I'm reducing CPU to reduce costs, I may talk to them about other ways of reducing costs: reviewing the software products they have, looking at their invoices, considering capping. This helps the client, and may lead to further opportunities for my firm.
So if I'm at a site and I see something that may impact their availability, I'll talk to them about it. For example, at one client I saw thousands of CICS application abends (and a couple of storage violations) every week. Something to talk about. At another, I saw that they hadn't activated the RACF TEMPDSN profile, a small security exposure. Something to talk about.
But before now, I've never been paid to help improve reliability or resilience. And in this case, this only happened after a production outage - a big one. But why?
Let's look at the good news. Mainframes, z/OS and the associated subsystems (CICS, IMS, DB2, and Websphere MQ) are very, very reliable. So reliable, that it's a major shock if we see an error. So with such robust infrastructure, it's easy to say "it's all reliable, nothing to do here."
There's more good news: most mainframe applications are very stable and reliable. They've been working for decades, so in most sites their mainframe applications are the most stable, and quietly hum along with few, if any, issues. So again, it's easy to say "if it ain't broke, don't fix it."
But let's take a reality check Firstly, things change. Take workloads. I've been working with a team on an application's performance. This application was written when there wasn't much workload. But as the workloads have jumped up and up, the application is starting to struggle. Today we're seeing response times of over seven seconds (not milliseconds, seconds).
Applications also change, and with change comes risk. In one client, I regularly report on the number of CICS transaction abends, by application. This application is regularly updated to keep up with business, compliance, and yes, performance requirements. So I regularly see some transactions with increased abend counts - many caused by recent application changes.
Systems also change. Hardware is upgraded, operating systems are upgraded. These upgrades may cause problems, or even 'uncover' existing problems that only appeared with the new features or performance of an upgrade. In one example, I saw a DASD upgrade that actually decreased performance, affecting the response times of an application I was studying.
So even with mainframes, reliability isn't a given. What's worse, reliability and resilience are a hard nut to crack. It's expensive to improve reliability and resilience, and it can be difficult to get budget approval for something that returns nothing other than 'business as usual.' In many cases, resilience and reliability only get the attention they deserve after some major outages. That's when the true value of resilience is seen.
For my client, they'll get a lot of benefit from getting me to look at resilience. I'll highlight abends, locate potential bottlenecks, and review infrastructure - from Websphere MQ pageset and queue depth settings to z/OS CPU provisioning. I'll work with their local teams to understand better the application structure and workloads, and look for areas that may need more work. In the end, their systems will be better prepared to quickly process current workloads, and the workloads in the future.