technical: Five z/OS Problem Prevention and Analysis Tools You Probably Aren't Using
z/OS continues to be the availability leader when compared with any other operating systems . Over the past years z/OS has included software and features to manage or prevent problems and further extend this leadership - many of which you're probably not using now.
Runtime Diagnostics takes the hard work out of z/OS problem determination. Set it off and it will search the z/OS OPERLOG for messages or message combinations that indicate a problem - all within a minute. It will also check things like ENQ and latch contention, address spaces using too much CPU or locks, looping tasks, and more. Rather than run continuously, it is designed to be started when a problem has been detected, and you want a place to start looking. In many ways it does initial problem diagnosis a Systems Programmer would do, but faster.
- What It Is: A z/OS based feature to quickly identify potential problems on a z/OS system.
- Why it's cool: Simple to setup, simple to use, and free with z/OS. Simple reports quickly indicate if a problem was found.
- Why it's not: It burns CPU. Only used when there is a problem to speed up diagnosis.
"Ahah!", you say. "We've already implemented Health Checker." And you're probably right. A 2014 SHARE survey showed 77% of sites had implemented Health Checker. And from z/OS 2.1, Health Checker will be started by default, which will take care of the rest.
However by 'implement', most sites will have it switched on with the default settings. Which is great - IBM provide a suite of checks that are brilliant, and are automatically implemented with little or no extra configuration effort. So far, so good.
The problem is that most sites consider Health Checker a handy tool for Systems Programmers. So there's no automation to detect Health Checker alerts, and notify people / raise problem tickets when they occur. Nor are there procedures or requirements to regularly check and resolve Health Checker alerts. This can result in issues being overlooked for some time.
- What It Is: Regularly and automatically check for problems or configurations that do not adhere to best practices.
- Why it's cool: Easy to implement, easy to use, almost zero CPU overhead, free with z/OS.
- Why it's not: No reasons. From z/OS 2.1, it will be started by default.
Predictive Failure Analysis
Predictive Failure Analysis (PFA) piggy-backs on the z/OS Health Checker infrastructure to add an extra layer of problem detection: soft failures. The classic example is a z/OS system running out of CSA memory. Many tasks use CSA, however you get little warning before it runs out. When this happens, your solution is often an IPL, and then go back to find out the cause.
PFA can detect this problem before it becomes critical. It analyses historical and current data (thanks to Health Checker facilities), saving historical data in a z/OS UNIX file system. From this it will raise Health Checker alerts when it thinks there is a problem. Current areas PFA checks includes CSA, memory usage, LOGREC record rates, SMF record rates, JES SPOOL usage and console message rates. IBM has indicated that more will be added in the future.
This all sounds great, but how much CPU does it burn? Most of PFA is Java, so offloading to zIIP is on the cards.
- What It Is: Enhance Health Checker to detect soft failures.
- Why it's cool: Easy to implement, easy to use, free with z/OS.
- Why it's not: Some small CPU overhead, though much can be offloaded to zIIP.
First released in 2013 with the EC12, zAware is an extra partition on your System z mainframe. It analyses console message traffic patterns, learning what is normal over a 90 day period. Using some nifty analysis techniques, it identifies patterns that may indicate a problem. With GUI screens for problem analysis, it's a nice tool to detect problems as they occur, and simplify problem determination.
The flipside is that it's an extra piece of hardware on your mainframe. It uses ½ to 2 IFL or GP processors (2 partial CPs recommended), so it will reduce the CPU available. It will also slightly increase the CPU usage of every z/OS system monitored - say doubling the System Logger CPU. The zAware partition also needs around 500GBytes of DASD, 4GBytes+ of memory, and adds load to your OSA or Hipersockets.
Bottom line - a solution for larger shops where availability is critical.
- What It Is: Advanced Workload Analysis Reporter (AWARE) - Additional LPAR installed on a System z mainframe to monitor OPERLOG messages. Uses analytics to identify console message traffic patterns that may indicate a problem. Can work with software such as NetView, z/OSMF, BMC MAINVIEW AutoOperator and Omegamon XE to automate monitoring of anomalies.
- Why it's cool: Finds problems other monitors may miss, and possibly before they impact service. Nice GUI interface for problem determination.
- Why it's not: It's an extra processing on your mainframe. It is firmware supplied with hardware, with quite a few configuration tasks required. It adds a bit of z/OS CPU, uses more CPU for the partition, DASD and memory. And one more thing: it isn't free.
In 2013 IBM released the Linux based SmartCloud Analytics Log Analysis (SCA-LA) software to quickly analyse log data from many different sources. It included GUI screens to simplify message searching and analysis. The z/OS Insight Packs expand this to z/OS: z/OS syslog and Websphere Application Server logs.
- What It Is: GUI to analyse, index and search through messages from multiple systems from a nice GUI screen.
- Why it's cool: Quickly search and process messages from multiple systems: z/OS and other.
- Why it's not: It isn't free. It is an extra Linux box, with message collectors on each z/OS system - so some z/OS CPU overhead.