LongEx Mainframe Quarterly - August 2014
Recently I was at a site with an over-worked performance team. This isn't unusual, as performance is a large, time-consuming task. This small group was responsible for many z/OS systems in different Parallel Sysplexes. They not only were responsible for z/OS systems performance, but were involved in any performance issues: batch overruns, CICS response times, Websphere MQ delivery times. However the big problem was that they spent all their time dealing with problems: they were fire-fighting, not tuning. Let's take an example. An application team rings up: "Our Websphere MQ performance between 10:00 and 10:30 this morning was 1.5 seconds. This is more than the 1 second specified in the Service Level Agreement. Tell us why and fix it!" So let's see what our performance group needs to do now:
All this is a lot of work. That 10 second phone call has taken out a few hours of a performance staff member's day. If this happens regularly, then that is all our performance team will be able to do. Full time fire-fighting A much better approach would be to monitor performance. So let's look at the perfect scenario: Our performance team has setup automated monitoring systems. Performance tools have been configured with SLAs and expected performance, so screens quickly show when things are outside of normal. Automated notifications (like emails) are sent to performance staff when something doesn't perform as it should. Daily batch jobs analyse SMF records, and produce performance reports that are archived. Our performance team can quickly look at the past performance of critical systems. Trends can be seen, and potential issues addressed before they become problems. If all this were the case, then our scenario above would be a little different:
Our perfect approach took less than 30 minutes, and our performance team were fixing the problem before it was reported. The problem is that I rarely see our perfect scenario. Simply put: performance isn't seen as important by management until there's a problem. Setting up the automated procedures and configuring monitoring tools takes time, and is an ongoing process as things change. Many performance groups are too busy fire-fighting to setup and maintain this monitoring. Performance monitoring and management is long-term. An investment to create and maintain infrastructure for effective, ongoing, automated monitoring will pay off again, and again, and again. |