LongEx Mainframe Quarterly - February 2015
z/OS continues to be the availability leader when compared with any other operating systems . Over the past years z/OS has included software and features to manage or prevent problems and further extend this leadership - many of which you're probably not using now. Runtime DiagnosticsRuntime Diagnostics takes the hard work out of z/OS problem determination. Set it off and it will search the z/OS OPERLOG for messages or message combinations that indicate a problem - all within a minute. It will also check things like ENQ and latch contention, address spaces using too much CPU or locks, looping tasks, and more. Rather than run continuously, it is designed to be started when a problem has been detected, and you want a place to start looking. In many ways it does initial problem diagnosis a Systems Programmer would do, but faster.
Health Checker"Ahah!", you say. "We've already implemented Health Checker." And you're probably right. A 2014 SHARE survey showed 77% of sites had implemented Health Checker. And from z/OS 2.1, Health Checker will be started by default, which will take care of the rest. However by 'implement', most sites will have it switched on with the default settings. Which is great - IBM provide a suite of checks that are brilliant, and are automatically implemented with little or no extra configuration effort. So far, so good. The problem is that most sites consider Health Checker a handy tool for Systems Programmers. So there's no automation to detect Health Checker alerts, and notify people / raise problem tickets when they occur. Nor are there procedures or requirements to regularly check and resolve Health Checker alerts. This can result in issues being overlooked for some time.
Predictive Failure AnalysisPredictive Failure Analysis (PFA) piggy-backs on the z/OS Health Checker infrastructure to add an extra layer of problem detection: soft failures. The classic example is a z/OS system running out of CSA memory. Many tasks use CSA, however you get little warning before it runs out. When this happens, your solution is often an IPL, and then go back to find out the cause. PFA can detect this problem before it becomes critical. It analyses historical and current data (thanks to Health Checker facilities), saving historical data in a z/OS UNIX file system. From this it will raise Health Checker alerts when it thinks there is a problem. Current areas PFA checks includes CSA, memory usage, LOGREC record rates, SMF record rates, JES SPOOL usage and console message rates. IBM has indicated that more will be added in the future. This all sounds great, but how much CPU does it burn? Most of PFA is Java, so offloading to zIIP is on the cards.
zAwareFirst released in 2013 with the EC12, zAware is an extra partition on your System z mainframe. It analyses console message traffic patterns, learning what is normal over a 90 day period. Using some nifty analysis techniques, it identifies patterns that may indicate a problem. With GUI screens for problem analysis, it's a nice tool to detect problems as they occur, and simplify problem determination. The flipside is that it's an extra piece of hardware on your mainframe. It uses ½ to 2 IFL or GP processors (2 partial CPs recommended), so it will reduce the CPU available. It will also slightly increase the CPU usage of every z/OS system monitored - say doubling the System Logger CPU. The zAware partition also needs around 500GBytes of DASD, 4GBytes+ of memory, and adds load to your OSA or Hipersockets. Bottom line - a solution for larger shops where availability is critical.
SCA-LAIn 2013 IBM released the Linux based SmartCloud Analytics Log Analysis (SCA-LA) software to quickly analyse log data from many different sources. It included GUI screens to simplify message searching and analysis. The z/OS Insight Packs expand this to z/OS: z/OS syslog and Websphere Application Server logs.
|