technical: Adventures with Resilience: Looking at Resilience As An External Consultant
In the past few years, I've been asked more regularly by clients to look for ways to improve resilience. Often this isn't a primary goal of the project, but added on. In some cases, the entire project has been to improve resilience.
So how do I, as an external consultant, look for resilience issues in a client site?
First stop: error messages. z/OS and major subsystems are great at identifying potential problems, and telling you about them.
So, I'll start with the z/OS console, and work my way to the major subsystems (IMS, CICS, Db2, MQ etc) to see if there are regularly occurring error messages. I may even use a REXX to scan through the z/OS syslog and isolate messages.
Let's look at some examples:
- CICS storage violations: in a couple of sites I've seen regular storage violations from CICS error messages. Such storage violations scare me, and I consider them a high resilience risk.
- VSAM Shareoptions: CICS is great, and outputs messages for a range of potential VSAM issues. For example, a VSAM dataset with shareoptions (3,3), or an ESDS with RECOVERY(ALL).
- Disk Space: DFSMS now produces error messages when the space used for a storage group exceeds a threshold. These sometimes pop up.
- MQ Queue Index: MQ produces a message if it detects an application accessing a queue by correlation ID or message ID, and the queue has no index. Not unheard of.
My next step is to see what abends are occurring, and how frequently. I'll start with SMF Type 30 step end records. I'll then move on to CICS SMF110 records for CICS transactions. I may also look at the EREP software records.
These will tell me if there are any significant issues. Some examples:
CICS ASRA Abends: I've seen a couple of applications that regularly have ASRA (S0C4) abends. This is a concern.
X37 Abends: In another article, I claim that sites should have no space related abends. If they do, there are some things that can be done to fix them.
z/OS Health Checker
My next port of call is the see what the z/OS Health Checker has. I confirm that checks are enabled, see if checks for some third-party products are enabled, and see if there are any errors. I firmly believe that every site should have a 'clean' z/OS Health Checker result, and am regularly surprised by sites that don't. Some examples:
- At one site, some channels were defined with a slower speed than they could achieve. This could have affected performance if workloads increased.
- At another site, they still had catalogs with IMBED and REPLICATE specified. IBM recommends that these not be specified, as they could impact catalog resilience.
- At a third site, their page datasets were quite full, indicating that the page dataset sizes were not large enough, and the site probably needed more memory.
- In many sites, I've seen checks disabled. I'm a big believer in using every feature available to improve resilience.
My next step is to look at performance. I like to create a heat chart of the performance index of every WLM service class period, and see if any are regularly higher than 1 (indicating that goals are not being met). This heat chart may look something like this:
I'll then dig deeper into service class periods with a performance index that is regularly higher than 1, or even those that are regularly far below 1.
In one site, I found many service classes and period that never met their goals. In this case, the current performance was OK, and the WLM definitions needed to be modified.
I'll often do a very quick 'capacity check'. This starts with a quick look at CPU capacity used in z/OS system using the SMF Type 70 records. I'll also look at the SMF Type 72 to see which subsystems are using the most CPU. This can help determine if CPU provisioning is an issue, and if something needs to be done about it.
I'll check the page datasets to see if they're being used, and if so, how often. I'll also try to get SMF70 and SMF72 statistics for a year, and look at workload growth.
So, what are some examples?
- I've found some sites where the CPU growth has been a concern. In some cases, I've found ways of reducing this. In others, the client has had to buy more CPU.
- In one site, I found that CICS online workloads consumed 80% of the CPU, and had a high priority. So lower priority workloads would regularly wait. Although we found some workloads that could have a lower priority, there wasn't much else we could do. The CICS workloads needed the higher priority, so the client had to look at buying more CPU.
Everyone forgets about coupling facilities, so I'll do a quick check. In particular, I'll use RMF Monitor III to check that they're not sharing CPs (or if they are, they're using thin interrupts), their coupling facility memory usage is around 50% of lower, and the coupling facility CPU usage is well below 50%.
Often, I don't have a lot of time to look at resilience. So, I've needed checks that quickly provide initial information about resilience. These can then guide any further digging or analysis.
The above steps have worked well.