LongEx Mainframe Quarterly - November 2020
Error MessagesFirst stop: error messages. z/OS and major subsystems are great at identifying potential problems, and telling you about them. So, I'll start with the z/OS console, and work my way to the major subsystems (IMS, CICS, Db2, MQ etc) to see if there are regularly occurring error messages. I may even use a REXX to scan through the z/OS syslog and isolate messages. Let's look at some examples:
AbendsMy next step is to see what abends are occurring, and how frequently. I'll start with SMF Type 30 step end records. I'll then move on to CICS SMF110 records for CICS transactions. I may also look at the EREP software records. These will tell me if there are any significant issues. Some examples: z/OS Health CheckerMy next port of call is the see what the z/OS Health Checker has. I confirm that checks are enabled, see if checks for some third-party products are enabled, and see if there are any errors. I firmly believe that every site should have a 'clean' z/OS Health Checker result, and am regularly surprised by sites that don't. Some examples:
PerformanceMy next step is to look at performance. I like to create a heat chart of the performance index of every WLM service class period, and see if any are regularly higher than 1 (indicating that goals are not being met). This heat chart may look something like this: I'll then dig deeper into service class periods with a performance index that is regularly higher than 1, or even those that are regularly far below 1. In one site, I found many service classes and period that never met their goals. In this case, the current performance was OK, and the WLM definitions needed to be modified. CapacityI'll often do a very quick 'capacity check'. This starts with a quick look at CPU capacity used in z/OS system using the SMF Type 70 records. I'll also look at the SMF Type 72 to see which subsystems are using the most CPU. This can help determine if CPU provisioning is an issue, and if something needs to be done about it. I'll check the page datasets to see if they're being used, and if so, how often. I'll also try to get SMF70 and SMF72 statistics for a year, and look at workload growth. So, what are some examples?
Coupling FacilityEveryone forgets about coupling facilities, so I'll do a quick check. In particular, I'll use RMF Monitor III to check that they're not sharing CPs (or if they are, they're using thin interrupts), their coupling facility memory usage is around 50% of lower, and the coupling facility CPU usage is well below 50%. ConclusionOften, I don't have a lot of time to look at resilience. So, I've needed checks that quickly provide initial information about resilience. These can then guide any further digging or analysis. The above steps have worked well. |