longpelaexpertise.com.au/ezine/TenThousandFootViewWLM.php?ezinemode=printfriend

LongEx Mainframe Quarterly - May 2022

technical: How To Show a 10,000 Foot View Of z/OS Performance

Here’s a question. How can you show the performance of a z/OS system? Is the performance good, or not good? If your job is to do exactly that, the chances are that you’re ready to start talking about the different workloads, response time goals for each, and really get into the nitty gritty. Different workloads will have different performance goals: so, your cash IMS transactions may have different performance goals to your customer CICS transactions. And these may change throughout the day: so those IMS transactions could run slower overnight.

To show all this performance, you’ll probably have a lot of graphs, tables and reports. You may have some showing response times for groups of CICS and IMS transactions, and others showing elapsed times for batch jobs or batch streams.

But, there’s an easier way to look at the overall performance of a z/OS system. It comes free with z/OS. And it’s running on your z/OS system right now.

WLM as a Performance Tool?

You know about z/OS Workload Manager (or WLM to its friends). The z/OS component that tries to ensure that the right things get z/OS resources when they need it.

Let’s step back and think about WLM for a bit. In many cases, its job is pretty easy. If there’s enough resources (particularly CPU) to go around, then there’s not much for WLM to do. However, when the CPU usage reaches 100%, WLM comes into its own. It makes sure that the important workloads get the CPU they need to achieve their performance goals. The less important workloads can wait.

WLM is brilliant at making the most of a z/OS system with limited resources. The interesting thing is that it knows what the performance goals are, because z/OS systems administrators have put these into the WLM configuration. IMS transactions, Db2 workloads and batch jobs have been divided into groups called service classes. These service classes have been divided into periods, and each has been set performance goals. For example, 90% of our billing CICS transactions (those beginning with ‘A’ in CICS region PCICS1) should complete in 0.5 seconds. Our hot batch is very important, and should have a velocity (time executing vs time waiting) of 50. Each has also been set an importance number: we now know which are more important, and which aren’t.

But here’s something interesting. We’ve already told WLM our performance objectives. So, we should be able to find out if workloads are achieving their objectives from WLM. And we can.

Performance Index

WLM assigns a number to every service class and period: a Performance Index (PI). You can hop to the IBM manuals to get the details about PI, but the basics are:

PI = 1: exactly achieving the performance objective

PI > 1: not achieving the performance objective (the higher the number, the worse the performance)

PI < 1: exceeding the performance objective

Hmmm. So WLM already has a number that tells us if our workloads are achieving their performance goals.

Even better. WLM records this PI periodically in the SMF Type 72 records. You can also see the PI using monitors like RMF Monitor III, IBM Omegamon, Broadcom SYSVIEW and more. So, how does this help?

A z/OS Performance Dashboard

Whenever I go to a new site, I want to get a quick look at the performance without knowing anything about the workloads. Is everything peachy, or are there some problems? If there are problems, are they only for some workloads, or everything? Are they during certain periods of the day, or all through the day?

I sometimes use monitors to get an initial feel. Here’s an example from IBM RMF Monitor III:

                      RMF V2R3   Sysplex Summary - PLEXA4           Line 1 of 70
 Command ===>                                                  Scroll ===> CSR

 WLM Samples: 1199    Systems: 3  Date: 05/26/22 Time: 23.39.00 Range: 300   Sec

                       >>>>>>>>XXXXXXXXXXXXXXXXXX<<<<<<<<

 Service Definition: STANDARD              Installed at: 02/08/20, 14.38.14
      Active Policy: STANDARD              Activated at: 02/08/20, 14.38.26

                ------- Goals versus Actuals --------  Trans --Avg. Resp. Time-
                Exec Vel  --- Response Time ---  Perf  Ended  WAIT EXECUT ACTUAL
 Name     T  I  Goal Act  ---Goal--- --Actual--  Indx  Rate   Time   Time   Time

 BATCH    W           88                               0.057  1718   7694   9289
 BATPROD  S  4    35  96                         0.36  0.010  2638   6928   8866
 BATTEST  S  5    35  88                         0.40  0.047  1521   7858   9379
 ONLINE   W           68                               56.80 0.483  890.1  25.33
 CICS     S  1        79    200 90%         99%  0.50  55.88 0.000  17.80  17.80
 ONLTASK1 S  1    60  81                         0.74  0.000 0.000  0.000  0.000
 DEVELOP  S  4        56   2500 85%         N/A   N/A  0.000 0.000  439.8  0.000
 DEVTASKS S  4    20  56                         0.36  0.013
 ONLINE   S  1       0.0    100 90%         96%  0.50  0.263 0.000  10.78  13.32
 ONLTASKS S  1    60  93                         0.65  0.630
 WEBTASK  S  1        67   2000 90%        100%  0.50  0.010 0.000  12.76  12.76
 SYSTEM   W           90                               0.000 0.000  0.000  0.000
 SYSSTC   S      N/A  88    N/A                        0.000 0.000  0.000  0.000

But these only provide a snapshot. I’m more interested in the history: particularly during the online day, or areas where CPU usage or performance are a problem. For this, I grab some SMF Type 72 records (ideally, 6 weeks, excluding weekends and holidays), and create a heat chart that looks like this:

At the left we have each service class and period. I also like to see the importance level (1 = highest, 5 = lowest). I exclude discretionary workloads: they aren't (or shouldn't be) important.

The rest shows the average PI for each class and period. If it’s 1, the cell is green (we’re happy). It gets redder as the PI increases (and we are less happy). If it is far less than 1, that could also be a problem: I make these blue.

A couple of things to note:

I remove the SYSTEM, SYSSTC service classes: nothing we can do with these
I use the average PI for each hour over the period
WLM Type 72 records don’t actually have a PI field: we need to calculate it from other fields

So, what can I see from this heat chart? A few things:

Something is happening at 11:00. The PI for all service classes jumps beyond 1
Outside of 11:00, UNIX period 1 is exceeding its performance goals. The performance goals may need to be changed
UNIX period 2 is regularly not meeting its performance goals. Either there’s a performance problem, or the performance goal isn’t right
Something is happening to the ONLINE service class after lunch: it’s PI increases
Other than this, overall performance is around 1: performance is good

An Easy Dashboard

If I were in charge of z/OS performance, I’d be running regular jobs to create this sort of heat chart, and publish it. Immediately, anyone will be able to get a view of how the performance of the z/OS system is going. More importantly, performance problems can be quickly identified, and then isolated.

David Stephens