technical: Limiting the Damage from Loops and Spikes
It was a shock when I first found the loop. I was at a client site to reduce their CPU usage when I noticed that they were burning far more CPU than normal. And it had been going on for days -using almost one entire processor engine. This was really going to hurt their software licensing costs.
Spikes and loops can be nasty creatures. They're symptoms of problems or errors, which in themselves are a concern. But often worse, they affect other systems and applications around them. From increased costs to reduced performance from other systems or applications, spikes and loops can seriously affect processing. So how can you limit the damage they cause?
What are Loops and Spikes?
Any long-term application programmer has seen a looping program - a program that executes a loop an excessive number of times - sometimes infinitely. This is a simple example of a loop. Put this into a batch job and you have a batch job using excessive CPU until it is cancelled. Bad, but not catastrophic. Put this into an online CICS transaction, and things get worse. Other CICS applications can be starved of CPU, affecting their response time. If the transaction accesses a database inside the loop, the database workload jumps up. If the application continues to acquire storage inside the loop, then CICS storage can become constrained. This can cause short on storage conditions, and seriously impact CICS processing.
This sounds bad, but it could be even worse. Put an abend inside the loop, and you have an application that seriously stresses the CICS region and its dump handling. Put that loop in z/OS SRB code, and you can lose a processor. Put it into a z/OS exit and you may lose your z/OS system. Loops can be bad.
But loops aren't limited to application programs and assembler routines. They can appear in REXX execs, be triggered from automated job schedules (think of two jobs automatically scheduling each other), automated operations rules, or even DFSMS ACS routines.
Loops are the most obvious cause of a spike - a short term, unexpected, large increase in CPU usage. But they're not the only causes. Consider a user accidentally running a transaction that scans all transactions for the past 12 months rather than the past 30 minutes. Or a business analyst setting a Business Rule that accidentally increases processing ten-fold. Maybe even an automated batch schedule that starts a batch suite needed only monthly every day.
When to Limit the Damage
Most sites concentrate on minimising loops and spikes, and this is an excellent idea. Proper testing and change management goes a long way. However many think less about ways of minimising the damage should one occur. One of the major reasons for this is that it is hard.
For starters, it is difficult to find the line: to determine when something is operating normally and when it is a loop/spike. Some systems may very well double their CPU usage at irregular intervals.
Another problem is the detection. A looping CICS transaction or batch job may be obvious. But a looping DFSMS ACS routine may only increase the CPU usage of the z/OS SMS address space - an address space rarely reported or monitored. Similarly a scheduling error may not register as a problem because all jobs are completing within their parameters: the problem may be the number of jobs submitted.
How to Limit the Damage
The key to limiting damage from loops and spikes is early warning. And this cannot be done without prior preparation. There are two approaches that can be used:
- Kill the offending task
Kill the Offending Task
It may sound harsh, but a tactic used for decades on mainframes is to kill or cancel tasks using excessive CPU or memory. In the old days this was essential to prevent mainframes from crashing. Today it is unlikely that a looping or spiking task will crash z/OS, however the same principles apply. Killing an offending task achieves a few things:
- It stop the offending task from causing further damage
- It often produces a dump that can be used for problem determination
- It activates automatic notification procedures for abends, so the relevant technical staff will know about the problem quickly
So how can this be done? Let's take a few examples:
Batch job - JCL JOB and STEP statements can be used to limit the resources a step of job can consume. For example REGION and MEMLIMIT can limit the memory used - a useful fuse for a looping application continually acquiring memory. Smart sites limit who can specify REGION=0M (ie no limit).
Similarly the TIME parameter specifies the maximum CPU time a job or step can use, and the PAGES parameter limits the amount of output sent to JES spool. Again, smart sites limit who can specify TIME=1440 (no limit). Many set reasonable defaults for all the above parameters using features such as the z/OS IEFACTRT exit.
- CICS - The CICS ICVR value specifies the runaway task time for all transactions. This can be overridden for individual transactions in the RDO Transaction resource RUNAWAY value.
- IMS - The IMS LOCKMAX parameter in the PSB or JCL override can limit the number of locks an application can acquire. The SEGNO parameter of a TRANSACT definition can limit the number of segments a message-driven transaction can acquire. Unfortunately, there is no SEGNO default - many IMS systems programmers use an IMS exit to set a default here.
- DB2 Stored Procedures - the ASUTIME parameter specifies the maximum CPU time in the stored procedure definition.
- Websphere Application Server - there are a few timeout values that can be specified for threads, together with a CPU time limit.
- SRBs - Few will need to code an SRB, but SRBs can and should be defined with a set CPU limit when scheduled.
In addition to these facilities, some monitors can be used to terminate tasks when set criteria are met. For example, Omegamon XE for CICS can cancel CICS transactions that consume more than a specified quantity of memory. However these facilities can be expensive in terms of CPU usage.
Most z/OS sites should have functions in place to automatically send alerts when a situation could impact z/OS processing. For example, alerts when system memory is low, WTO buffers are low, CPU usage is high, or free DASD space is low. However many sites only have a minimum of these alerts, if any.
Few sites will not have several monitoring tools for various systems and products. Often these can be configured to send alerts on set conditions. So TMON for CICS could be set to generate alerts when a CICS transaction exceeds a set CPU usage. Similarly automated operations can be used to generate alerts for all sorts of conditions: from batch processing missing windows to started tasks failing.
z/OS and related systems also include many facilities that can help. For example, the z/OS Runtime Diagnostics feature can generate an alert when an address space is using more than 95% of the capacity of a single CPU, or performing consistent repetitive processing (ie looping). Exits can also help. For example, the IMS DFSQSPC0/DFSQSSP0 exit can be used to report on IMS queue space issues - IBM provides samples.
The looping problem I found was caused by an output archiving product. It had a report definition that created an infinite loop, and it had been doing so for over a week. Luckily it didn't impact any other processing, but my client's software licensing costs took a hit.
Looping or spiking tasks can hurt, and they're not just in applications. Prevention is an important step in reducing such tasks, however it's wise to supplement this with procedures and functions to limit the damage should they occur. Setting your z/OS and related systems, together with their related monitors correctly can stop these tasks before they can perform too much damage, or at least notify support staff as soon as they occur.