technical: How To Show a 10,000 Foot View Of z/OS Performance
Here’s a question. How can you show the performance of a z/OS system? Is the performance good, or not good? If your job is to do exactly that, the chances are that you’re ready to start talking about the different workloads, response time goals for each, and really get into the nitty gritty. Different workloads will have different performance goals: so, your cash IMS transactions may have different performance goals to your customer CICS transactions. And these may change throughout the day: so those IMS transactions could run slower overnight.
To show all this performance, you’ll probably have a lot of graphs, tables and reports. You may have some showing response times for groups of CICS and IMS transactions, and others showing elapsed times for batch jobs or batch streams.
But, there’s an easier way to look at the overall performance of a z/OS system. It comes free with z/OS. And it’s running on your z/OS system right now.
WLM as a Performance Tool?
You know about z/OS Workload Manager (or WLM to its friends). The z/OS component that tries to ensure that the right things get z/OS resources when they need it.
Let’s step back and think about WLM for a bit. In many cases, its job is pretty easy. If there’s enough resources (particularly CPU) to go around, then there’s not much for WLM to do. However, when the CPU usage reaches 100%, WLM comes into its own. It makes sure that the important workloads get the CPU they need to achieve their performance goals. The less important workloads can wait.
WLM is brilliant at making the most of a z/OS system with limited resources. The interesting thing is that it knows what the performance goals are, because z/OS systems administrators have put these into the WLM configuration. IMS transactions, Db2 workloads and batch jobs have been divided into groups called service classes. These service classes have been divided into periods, and each has been set performance goals. For example, 90% of our billing CICS transactions (those beginning with ‘A’ in CICS region PCICS1) should complete in 0.5 seconds. Our hot batch is very important, and should have a velocity (time executing vs time waiting) of 50. Each has also been set an importance number: we now know which are more important, and which aren’t.
But here’s something interesting. We’ve already told WLM our performance objectives. So, we should be able to find out if workloads are achieving their objectives from WLM. And we can.
WLM assigns a number to every service class and period: a Performance Index (PI). You can hop to the IBM manuals to get the details about PI, but the basics are:
PI = 1: exactly achieving the performance objective
PI > 1: not achieving the performance objective (the higher the number, the worse the performance)
PI < 1: exceeding the performance objective
Hmmm. So WLM already has a number that tells us if our workloads are achieving their performance goals.
Even better. WLM records this PI periodically in the SMF Type 72 records. You can also see the PI using monitors like RMF Monitor III, IBM Omegamon, Broadcom SYSVIEW and more. So, how does this help?
A z/OS Performance Dashboard
Whenever I go to a new site, I want to get a quick look at the performance without knowing anything about the workloads. Is everything peachy, or are there some problems? If there are problems, are they only for some workloads, or everything? Are they during certain periods of the day, or all through the day?
I sometimes use monitors to get an initial feel. Here’s an example from IBM RMF Monitor III:
RMF V2R3 Sysplex Summary - PLEXA4 Line 1 of 70
Command ===> Scroll ===> CSR
WLM Samples: 1199 Systems: 3 Date: 05/26/22 Time: 23.39.00 Range: 300 Sec
Service Definition: STANDARD Installed at: 02/08/20, 14.38.14
Active Policy: STANDARD Activated at: 02/08/20, 14.38.26
------- Goals versus Actuals -------- Trans --Avg. Resp. Time-
Exec Vel --- Response Time --- Perf Ended WAIT EXECUT ACTUAL
Name T I Goal Act ---Goal--- --Actual-- Indx Rate Time Time Time
BATCH W 88 0.057 1718 7694 9289
BATPROD S 4 35 96 0.36 0.010 2638 6928 8866
BATTEST S 5 35 88 0.40 0.047 1521 7858 9379
ONLINE W 68 56.80 0.483 890.1 25.33
CICS S 1 79 200 90% 99% 0.50 55.88 0.000 17.80 17.80
ONLTASK1 S 1 60 81 0.74 0.000 0.000 0.000 0.000
DEVELOP S 4 56 2500 85% N/A N/A 0.000 0.000 439.8 0.000
DEVTASKS S 4 20 56 0.36 0.013
ONLINE S 1 0.0 100 90% 96% 0.50 0.263 0.000 10.78 13.32
ONLTASKS S 1 60 93 0.65 0.630
WEBTASK S 1 67 2000 90% 100% 0.50 0.010 0.000 12.76 12.76
SYSTEM W 90 0.000 0.000 0.000 0.000
SYSSTC S N/A 88 N/A 0.000 0.000 0.000 0.000
But these only provide a snapshot. I’m more interested in the history: particularly during the online day, or areas where CPU usage or performance are a problem. For this, I grab some SMF Type 72 records (ideally, 6 weeks, excluding weekends and holidays), and create a heat chart that looks like this:
At the left we have each service class and period. I also like to see the importance level (1 = highest, 5 = lowest). I exclude discretionary workloads: they aren't (or shouldn't be) important.
The rest shows the average PI for each class and period. If it’s 1, the cell is green (we’re happy). It gets redder as the PI increases (and we are less happy). If it is far less than 1, that could also be a problem: I make these blue.
A couple of things to note:
- I remove the SYSTEM, SYSSTC service classes: nothing we can do with these
- I use the average PI for each hour over the period
- WLM Type 72 records don’t actually have a PI field: we need to calculate it from other fields
So, what can I see from this heat chart? A few things:
- Something is happening at 11:00. The PI for all service classes jumps beyond 1
- Outside of 11:00, UNIX period 1 is exceeding its performance goals. The performance goals may need to be changed
- UNIX period 2 is regularly not meeting its performance goals. Either there’s a performance problem, or the performance goal isn’t right
- Something is happening to the ONLINE service class after lunch: it’s PI increases
- Other than this, overall performance is around 1: performance is good
An Easy Dashboard
If I were in charge of z/OS performance, I’d be running regular jobs to create this sort of heat chart, and publish it. Immediately, anyone will be able to get a view of how the performance of the z/OS system is going. More importantly, performance problems can be quickly identified, and then isolated.