LongEx Mainframe Quarterly - May 2022
Here’s a question. How can you show the performance of a z/OS system? Is the performance good, or not good? If your job is to do exactly that, the chances are that you’re ready to start talking about the different workloads, response time goals for each, and really get into the nitty gritty. Different workloads will have different performance goals: so, your cash IMS transactions may have different performance goals to your customer CICS transactions. And these may change throughout the day: so those IMS transactions could run slower overnight.
WLM as a Performance Tool?You know about z/OS Workload Manager (or WLM to its friends). The z/OS component that tries to ensure that the right things get z/OS resources when they need it. Let’s step back and think about WLM for a bit. In many cases, its job is pretty easy. If there’s enough resources (particularly CPU) to go around, then there’s not much for WLM to do. However, when the CPU usage reaches 100%, WLM comes into its own. It makes sure that the important workloads get the CPU they need to achieve their performance goals. The less important workloads can wait. WLM is brilliant at making the most of a z/OS system with limited resources. The interesting thing is that it knows what the performance goals are, because z/OS systems administrators have put these into the WLM configuration. IMS transactions, Db2 workloads and batch jobs have been divided into groups called service classes. These service classes have been divided into periods, and each has been set performance goals. For example, 90% of our billing CICS transactions (those beginning with ‘A’ in CICS region PCICS1) should complete in 0.5 seconds. Our hot batch is very important, and should have a velocity (time executing vs time waiting) of 50. Each has also been set an importance number: we now know which are more important, and which aren’t. But here’s something interesting. We’ve already told WLM our performance objectives. So, we should be able to find out if workloads are achieving their objectives from WLM. And we can. Performance IndexWLM assigns a number to every service class and period: a Performance Index (PI). You can hop to the IBM manuals to get the details about PI, but the basics are: Hmmm. So WLM already has a number that tells us if our workloads are achieving their performance goals. Even better. WLM records this PI periodically in the SMF Type 72 records. You can also see the PI using monitors like RMF Monitor III, IBM Omegamon, Broadcom SYSVIEW and more. So, how does this help? A z/OS Performance DashboardWhenever I go to a new site, I want to get a quick look at the performance without knowing anything about the workloads. Is everything peachy, or are there some problems? If there are problems, are they only for some workloads, or everything? Are they during certain periods of the day, or all through the day? I sometimes use monitors to get an initial feel. Here’s an example from IBM RMF Monitor III: RMF V2R3 Sysplex Summary - PLEXA4 Line 1 of 70 Command ===> Scroll ===> CSR WLM Samples: 1199 Systems: 3 Date: 05/26/22 Time: 23.39.00 Range: 300 Sec >>>>>>>>XXXXXXXXXXXXXXXXXX<<<<<<<< Service Definition: STANDARD Installed at: 02/08/20, 14.38.14 Active Policy: STANDARD Activated at: 02/08/20, 14.38.26 ------- Goals versus Actuals -------- Trans --Avg. Resp. Time- Exec Vel --- Response Time --- Perf Ended WAIT EXECUT ACTUAL Name T I Goal Act ---Goal--- --Actual-- Indx Rate Time Time Time BATCH W 88 0.057 1718 7694 9289 BATPROD S 4 35 96 0.36 0.010 2638 6928 8866 BATTEST S 5 35 88 0.40 0.047 1521 7858 9379 ONLINE W 68 56.80 0.483 890.1 25.33 CICS S 1 79 200 90% 99% 0.50 55.88 0.000 17.80 17.80 ONLTASK1 S 1 60 81 0.74 0.000 0.000 0.000 0.000 DEVELOP S 4 56 2500 85% N/A N/A 0.000 0.000 439.8 0.000 DEVTASKS S 4 20 56 0.36 0.013 ONLINE S 1 0.0 100 90% 96% 0.50 0.263 0.000 10.78 13.32 ONLTASKS S 1 60 93 0.65 0.630 WEBTASK S 1 67 2000 90% 100% 0.50 0.010 0.000 12.76 12.76 SYSTEM W 90 0.000 0.000 0.000 0.000 SYSSTC S N/A 88 N/A 0.000 0.000 0.000 0.000 But these only provide a snapshot. I’m more interested in the history: particularly during the online day, or areas where CPU usage or performance are a problem. For this, I grab some SMF Type 72 records (ideally, 6 weeks, excluding weekends and holidays), and create a heat chart that looks like this: At the left we have each service class and period. I also like to see the importance level (1 = highest, 5 = lowest). I exclude discretionary workloads: they aren't (or shouldn't be) important. The rest shows the average PI for each class and period. If it’s 1, the cell is green (we’re happy). It gets redder as the PI increases (and we are less happy). If it is far less than 1, that could also be a problem: I make these blue. A couple of things to note:
So, what can I see from this heat chart? A few things:
An Easy DashboardIf I were in charge of z/OS performance, I’d be running regular jobs to create this sort of heat chart, and publish it. Immediately, anyone will be able to get a view of how the performance of the z/OS system is going. More importantly, performance problems can be quickly identified, and then isolated. |