longpelaexpertise.com.au/ezine/CStorageOverlays.php?ezinemode=printfriend

LongEx Mainframe Quarterly - August 2021

technical: Diagnosing C Storage Overlays on z/OS

In our article The Problem With z/OS Strings and C, we showed how it is easy when working with strings in C to create storage overlays. One of the big reasons for this is that much of the z/OS data is fixed-length, while many C string functions assume that all strings end in a NULL. But this isn't the only way this can happen.

A strong feature of C is the use of pointers. However, the strength and functionality of pointers is also a weakness: get a pointer wrong, and you can easily be working with the wrong storage.

Diagnosing and resolving any storage overlay is difficult. Fortunately, on z/OS there are some tools and features that can help. Let's look at some of them.

Module Preparation

The ideal solution is to prevent storage overlays before they happen. In practice, this can be very difficult. In our article The Problem With z/OS Strings and C, we talk about some programming options that can reduce the chance of storage overlays.

If prevention doesn't work, then the strategy is to detect with diagnostic information any storage overlays as soon as possible. Or in other words, abend as soon as possible after an overlay. Often, storage overlays occur sometime before an abend or other symptom is detected, making it harder to track down the option. If we can get a dump or other indication as soon as possible, it makes diagnosis easier.

The IBM XL C/C++ compiler option STACKPROTECT creates extra code to protect against storage overlays affecting the storage stack. Although this may decrease performance, it may provide earlier indication of a storage overlay.

Another way of increasing our chances of early detection is to create a re-entrant module. This can be done purely by programming, or by specifying the RENT parameter. If we then bind the module as re-entrant, some environments will load our program source into protected memory. This includes APF authorised modules, modules executed in z/OS UNIX, and CICS modules if the CICS SIT parameter RENTPGM=PROTECT is set. Then, if our program attempts to write in this protected area, it will abend.

HEAPCHK

C and C++ use z/OS Language Environment (LE) to manage storage. So, we can use LE features to diagnose our problems. HEAPCHK is one of these options.

Consider the following program:

#include <stdlib.h>
char *str1;
main() {
  str1 = malloc(15);
  strcpy(str1,"This is a string of length 30!");
  printf(str1);
  free(str1);
}

C programmers will instantly see our problem: we're copying a 30-byte string into a 15-byte character array. When we run this in a batch job, we get a user abend from Language Environment:

+CEE3798I ATTEMPTING TO TAKE A DUMP FOR ABEND U4094 TO DATA SET: DZS.D174
IGD100I 03DE ALLOCATED TO DDNAME SYS00001 DATACLAS (        )
IEA822I COMPLETE TRANSACTION DUMP WRITTEN TO DZS.D174
+CEE3797I LANGUAGE ENVIRONMENT HAS DYNAMICALLY CREATED A DUMP.
IEA995I SYMPTOM DUMP OUTPUT  031
  USER COMPLETION CODE=4094 REASON CODE=0000002C
 TIME=01.01.42  SEQ=00141  CPU=0000  ASID=0036
 PSW AT TIME OF ERROR  078D1400   85DDF9A6  ILC 2  INTC 0D
   NO ACTIVE MODULE FOUND
   NAME=UNKNOWN
   DATA AT PSW  05DDF9A0 - 00181610  0A0DA7F4  001C1811

Language Environment has detected that some of its control information has been overwritten, and abended with a dump. The SYSOUT DD doesn't tell us much more:

CEE0802C Heap storage control information was damaged.
         The traceback information could not be determined.

Let's add the following #pragma statement to the top:

#pragma runopts(HEAPCHK(ON,1,0,10,10,10,0,0,0))

Now, when we run our program, we still get an abend, but a U4042 rather than our original U4094. So, how does this help us? If we look at our SYSOUT DD, we see the following messages:

CEE3701W Heap damage found by HEAPCHK run-time option.
CEE3707I Left  pointer is bad in the free tree at 1AF39D30 in the heap 
segment beginning at 1AF39018.
1AF39D10: 00000000 00000000 1AF39018 00000018  E38889A2 4089A240 8140A385
 A2A340A2
|.........3......This is a test s|
1AF39D30: A3998995 874B0000 00000000 00000000  00000000 00000000 00000000
 00000000
|tring...........................|
CEE3707I Right pointer is bad in the free tree at 1AF39D30 in the heap 
segment beginning at 1AF39018.
1AF39D10: 00000000 00000000 1AF39018 00000018  E38889A2 4089A240 8140A385
 A2A340A2
|.........3......This is a test s|
1AF39D30: A3998995 874B0000 00000000 00000000  00000000 00000000 00000000
 00000000
|tring...........................|
CEE3702S Program terminating due to heap damage.

HEAPCHK tells LE to regularly check all heap storage for storage overlays. In our example, it detected the same storage overlay twice. This is because we specified to HEAPCHK to check the heap after every LE call. It doesn't exactly point to the place in our program where the problem occurred: we'll still need to look at the dump to track, this down. But it's a good place to start. More importantly, it is an early warning of our problem. Often, we don't see a dump from a storage overlay until well after the actual overlay has occurred. Tracking these down then becomes a real problem.

The HEAPCHK parameters allow programmers to determine how often the heap should be checked, and other features. We've used a C #pragma command to set it on. But there are other ways to specify this LE option.

The disadvantage to HEAPCHK is that it is heavy, and will impact performance. A lot. So, this should only be implemented when it is really needed. One option may be to use HEAPCHK when performing QA testing. Or in production for one batch step that is causing problems.

HEAPZONES

HEAPCHK is a very heavy option. From z/OS 2.1, HEAPZONES provides an alternative. Rather than check the HEAP regularly, HEAPZONES only checks when the heap is freed. It also allocates an extra piece of storage at the end of each heap zone, and checks this to see if it has been overwritten.

We cover HEAPZONES in more detail in our article Using HEAPZONES to Fix C Storage Overlays on z/OS.

CICS

CICS manages its own storage, though it still uses LE. So, there are some extra features we can use within CICS.

CICS programs can be defined with an execution key of either CICS or User. If the CICS SIT parameter STGPROT is set to YES, CICS-key programs can overwrite CICS programs and control blocks, User-key programs cannot. Ensuring C programs are User-key will increase the chance of detecting a storage overlay if the program attempts to access CICS systems storage.

This early detection is enhanced by specifying the CICS SIT parameter TRANISO=YES. This ensures that a C program in one transaction cannot overwrite storage in another transaction. TRANISO may affect performance, or increase CPU usage.

When a storage overlay is detected by CICS, it is called a storage violation. The CICS SIT option CHKSTRM enables regular checks for violations of a control block called the TIOA. The CHKSTSK parameter enables regular checking of all storage for violations. CHKSTRM and CHKSTSK will also impact performance, and increase CPU.

When a storage violation is detected, CICS produces a dump. There are procedures for analysing these dumps. Scott McClure from IBM spelt these out in a webcast he presented in November 2015.

Tools for Storage Overlays

If you can't stop storage overlays, then you're in for some work. Fortunately, there are some tools on z/OS that can help. These will give you some good information, but won't lead you directly to the offending line of code. You'll still need to dive into some dumps.


David Stephens