management: How to Measure Reliability
In October, 2015, Australia's St George Bank experienced a mainframe failure, affecting its customers and partner banks for some days. And this isn't the only high-profile mainframe outage over the past few years. But it still comes as a surprise. Mainframe applications and systems are expected to be reliable – very reliable. And they usually are. But how can you really measure how reliable a mainframe application or system is? And does anyone really do this?
Whenever anyone thinks of reliability, Mean Time Between Failure (MTBF) is probably one of the first metrics they think of. A measurement of the average time between 'failures', it's a common metric used with equipment. A manufacturer can measure the time between failures of, say 10,000 units, and get an idea of the reliability of that piece of equipment. They can also look at the reliability as the equipment ages, and come up with recommendations on when the equipment should be replaced.
But MTBF is less useful with computing. Equipment such as televisions or industrial machines usually work for a period of time, then (probably) fail –something has worn out. So a MTBF value for a TV of 10 years sounds good: use a TV for ten years and then replace it.
However a computer program isn't physical. For online transactions, many thousands of transactions could execute in a few minutes. Hopefully most will succeed, and only a small number fail. So if I have 1000 transactions running a second, the MTBF if 10 or 100 of these fail is still less than a second. You could still measure this in milliseconds (or microseconds), but the figure is losing its relevance.
MBTF looks more interesting for batch applications. But a batch job may run daily, weekly or monthly. I've seen one batch program that ran 10,000 times in one hour. So MTBF is only useful if comparing batch jobs that run the same frequency.
As far back as 1978, a NASA report (The Determination of Measures of Software Reliability by FD Maxwell) talked about two metrics:
- Failure ratio: F/N: number of failures (F) in a number of runs or executions in a given calendar period (N)
- Failure rate: f/t: number of failures (f) during the total CPU time in a given calendar period (t)
These look better. Though the original NASA report was talking more about failures during testing – not during production (which for NASA can't be good).
A more common number I've used is simply the number of failures in a period. For example, the number of abends in a week, or number of times a batch job failed in a week. Providing you are comparing figures for the same period of time (a week in our examples), this works well.
To get some decent failure ratios or rates, one of the first issues is to measure the number of failures. The good news is that users of z/OS have SMF records that help out here.
Perhaps the first place to go to is SMF records to detect abends – mainframe crashes. Mainframe systems are great, and abends are a good indicator of an application failure. We can even get some statistics about abends that were handled by the application (and so may not appear in logs). There are a few sources of information here:
- SMF Type 30 records include if a job step abended, and the abend code
- SMF Type 110 records include the similar information for a CICS transaction
- IMS Logs have similar information for IMS programs
- SMF Type 101 records abends for each DB2 stored procedure
- z/OS Logrec records many software abends
So we can get some good statistics. For example, a CICS transaction may have 10 abends a week, or a batch job may abend once every 100 executions.
There are other statistics if you want to go down further. For example, Websphere MQ can record the number of messages sent to a dead letter queue, and CICS records the number of short on storage alerts.
When looking a reliability, abend statistics are one of the first areas I look. Interestingly, I've found many sites that don't perform this analysis on a regular basis.
So far we've talked about crashes – when an application or system abends. But of course there are other possible failures. One of the most obvious are error messages. These may indicate a problem that affects the operation of the system or application, but doesn't crash it.
When I'm looking at a subsystem for resilience, error messages are amongst the first things I look for. I'll get the error messages from the syslog, and use a REXX or Splunk to process them. Here's an example from a Websphere MQ channel initiator address space:
Now, some of these aren't very important. For example, CSQX209E occurs when a client ends the connection without a clean close – nice to clean up, but not a big issue. However look at CSQX548E – this occurs when messages are sent to the dead letter queue (DLQ). This is potentially more serious. So a useful metric could be the number of DLQ messages per week.
The problem with all the metrics to date is that they assume that all failures are of equal importance. Take our channel initiator messages above: CSQX209E isn't imporant, CSQX548E is. Similarly some abends (like a S222: operator cancelled) are less important than others (like a S0C4: program memory address exception). So prioritisation is important. Of course, this means that someone has to prioritise each error, and keep this priority table up to date.
Some of this may be done for you. For example, automation products are setup to suppress unimportant error messages, and highlight important ones. Many of my clients have set up automatic problem reporting from this automation. So problem records may be more useful in determining the rate of 'important' abends or error messages.
Prioritisation can provide some more relevant metrics. For example, the number of high severity abends per hour or high severity messages per week.
Up to know we've been looking at errors: abends and error messages. But performance is also important. For example, users may expect a response time from a CICS transaction of less than one second. So if a transaction takes 5 seconds, this is probably an error. Similarly, a batch process running later than a set time may be an error.
Again, the above SMF records provides this information for CICS and IMS transactions, DB2 operations and batch jobs. So a high severity error may be a CICS transaction taking 150% of the time specified in a service level agreement.
Websphere MQ related applications provide a different perspective. These are often 'near' real-time, meaning that MQ messages must be processed within a few seconds or minutes. Measuring these can be more difficult when one transaction processes multiple input messages or requests. SMF 116 subtype 3 records can provide detailed information on a queue-level basis (maximum queue depth, queue ageing etc.), but won't provide information on each individual MQ message. In a recent project I've been faced with this issue. Given that the receiving CICS transactions are working normally, my client used queue depths as a measurement of performance. MQ monitoring was used to generate alerts if the queue depths exceeded a set percentage of the maximum queue depth.
You'll have noticed that I haven't mentioned a basic statistics used by many: downtime. Most service levels specify a figure for the maximum unscheduled downtime, or downtime during peak or business periods. And this is important.
However in my experience, mainframe systems are reliable, and outages: meaning a CICS, Websphere MQ or IMS system being down or unavailable, are rare. More often, problems are 'less' severe: a CICS transaction abending, a Websphere MQ queue filling up, a batch job abending or running late.
Who Uses Reliability Statistics
So who actually uses reliability statistics such as we've discussed? In my experience, clients generally only measure reliability statistics as it relates to service levels: CICS transaction response times, batch schedule cut-off times, uptime/downtime. Few today have the time or resources to add other monitoring of statistics as an 'early warning' of potential problems. So when I produce reports such as the number of abends or error messages, it often tells them information that they didn't know.