longpelaexpertise.com.au/ezine/FirefightingAndTuning.php?ezinemode=printfriend

LongEx Mainframe Quarterly - August 2014

opinion: Stop Firefighting and Start Tuning

Recently I was at a site with an over-worked performance team. This isn't unusual, as performance is a large, time-consuming task. This small group was responsible for many z/OS systems in different Parallel Sysplexes. They not only were responsible for z/OS systems performance, but were involved in any performance issues: batch overruns, CICS response times, Websphere MQ delivery times. However the big problem was that they spent all their time dealing with problems: they were fire-fighting, not tuning.

Let's take an example. An application team rings up: "Our Websphere MQ performance between 10:00 and 10:30 this morning was 1.5 seconds. This is more than the 1 second specified in the Service Level Agreement. Tell us why and fix it!" So let's see what our performance group needs to do now:

  1. They need to confirm what the application team has said. So they must go into the monitoring tools they have and take a look at the response times. Hopefully, they're familiar with the SLA, and can quickly confirm that the response time is too high
  2. They must confirm that this is unusual, by looking at response times at similar times in the past. If the Websphere MQ response time is always higher than the SLA, then there should already be a project working on this
  3. If they confirm there's a problem, then they need to find out why. So they'll look at the z/OS and related performance during that period to find out what happened. They'll also look to see what, if anything has changed

All this is a lot of work. That 10 second phone call has taken out a few hours of a performance staff member's day. If this happens regularly, then that is all our performance team will be able to do. Full time fire-fighting

A much better approach would be to monitor performance. So let's look at the perfect scenario:

Our performance team has setup automated monitoring systems. Performance tools have been configured with SLAs and expected performance, so screens quickly show when things are outside of normal. Automated notifications (like emails) are sent to performance staff when something doesn't perform as it should.

Daily batch jobs analyse SMF records, and produce performance reports that are archived. Our performance team can quickly look at the past performance of critical systems. Trends can be seen, and potential issues addressed before they become problems.

If all this were the case, then our scenario above would be a little different:

  1. Performance team is notified by automated systems that Websphere MQ performance between 10:00 and 10:30 wasn't sufficient
  2. Automated systems also notify the performance team that a CICS transaction is looping at the same time
  3. By the time the application teams rings, the performance team has already confirmed that the looping CICS transaction consumed excessive CPU, starving Websphere MQ. The problem transaction was terminated, and the relevant application team notified

Our perfect approach took less than 30 minutes, and our performance team were fixing the problem before it was reported.

The problem is that I rarely see our perfect scenario. Simply put: performance isn't seen as important by management until there's a problem.

Setting up the automated procedures and configuring monitoring tools takes time, and is an ongoing process as things change. Many performance groups are too busy fire-fighting to setup and maintain this monitoring.

Performance monitoring and management is long-term. An investment to create and maintain infrastructure for effective, ongoing, automated monitoring will pay off again, and again, and again.


David Stephens