LongEx Mainframe Quarterly - August 2021
In our article The Problem With z/OS Strings and C, we showed how it is easy when working with strings in C to create storage overlays. One of the big reasons for this is that much of the z/OS data is fixed-length, while many C string functions assume that all strings end in a NULL. But this isn't the only way this can happen.
Module PreparationThe ideal solution is to prevent storage overlays before they happen. In practice, this can be very difficult. In our article The Problem With z/OS Strings and C, we talk about some programming options that can reduce the chance of storage overlays. If prevention doesn't work, then the strategy is to detect with diagnostic information any storage overlays as soon as possible. Or in other words, abend as soon as possible after an overlay. Often, storage overlays occur sometime before an abend or other symptom is detected, making it harder to track down the option. If we can get a dump or other indication as soon as possible, it makes diagnosis easier. The IBM XL C/C++ compiler option STACKPROTECT creates extra code to protect against storage overlays affecting the storage stack. Although this may decrease performance, it may provide earlier indication of a storage overlay. Another way of increasing our chances of early detection is to create a re-entrant module. This can be done purely by programming, or by specifying the RENT parameter. If we then bind the module as re-entrant, some environments will load our program source into protected memory. This includes APF authorised modules, modules executed in z/OS UNIX, and CICS modules if the CICS SIT parameter RENTPGM=PROTECT is set. Then, if our program attempts to write in this protected area, it will abend. HEAPCHKC and C++ use z/OS Language Environment (LE) to manage storage. So, we can use LE features to diagnose our problems. HEAPCHK is one of these options. Consider the following program: #include <stdlib.h> char *str1; main() { str1 = malloc(15); strcpy(str1,"This is a string of length 30!"); printf(str1); free(str1); } C programmers will instantly see our problem: we're copying a 30-byte string into a 15-byte character array. When we run this in a batch job, we get a user abend from Language Environment: +CEE3798I ATTEMPTING TO TAKE A DUMP FOR ABEND U4094 TO DATA SET: DZS.D174 IGD100I 03DE ALLOCATED TO DDNAME SYS00001 DATACLAS ( ) IEA822I COMPLETE TRANSACTION DUMP WRITTEN TO DZS.D174 +CEE3797I LANGUAGE ENVIRONMENT HAS DYNAMICALLY CREATED A DUMP. IEA995I SYMPTOM DUMP OUTPUT 031 USER COMPLETION CODE=4094 REASON CODE=0000002C TIME=01.01.42 SEQ=00141 CPU=0000 ASID=0036 PSW AT TIME OF ERROR 078D1400 85DDF9A6 ILC 2 INTC 0D NO ACTIVE MODULE FOUND NAME=UNKNOWN DATA AT PSW 05DDF9A0 - 00181610 0A0DA7F4 001C1811 Language Environment has detected that some of its control information has been overwritten, and abended with a dump. The SYSOUT DD doesn't tell us much more: CEE0802C Heap storage control information was damaged. The traceback information could not be determined. Let's add the following #pragma statement to the top: #pragma runopts(HEAPCHK(ON,1,0,10,10,10,0,0,0)) Now, when we run our program, we still get an abend, but a U4042 rather than our original U4094. So, how does this help us? If we look at our SYSOUT DD, we see the following messages: CEE3701W Heap damage found by HEAPCHK run-time option. CEE3707I Left pointer is bad in the free tree at 1AF39D30 in the heap segment beginning at 1AF39018. 1AF39D10: 00000000 00000000 1AF39018 00000018 E38889A2 4089A240 8140A385 A2A340A2 |.........3......This is a test s| 1AF39D30: A3998995 874B0000 00000000 00000000 00000000 00000000 00000000 00000000 |tring...........................| CEE3707I Right pointer is bad in the free tree at 1AF39D30 in the heap segment beginning at 1AF39018. 1AF39D10: 00000000 00000000 1AF39018 00000018 E38889A2 4089A240 8140A385 A2A340A2 |.........3......This is a test s| 1AF39D30: A3998995 874B0000 00000000 00000000 00000000 00000000 00000000 00000000 |tring...........................| CEE3702S Program terminating due to heap damage. HEAPCHK tells LE to regularly check all heap storage for storage overlays. In our example, it detected the same storage overlay twice. This is because we specified to HEAPCHK to check the heap after every LE call. It doesn't exactly point to the place in our program where the problem occurred: we'll still need to look at the dump to track, this down. But it's a good place to start. More importantly, it is an early warning of our problem. Often, we don't see a dump from a storage overlay until well after the actual overlay has occurred. Tracking these down then becomes a real problem. The HEAPCHK parameters allow programmers to determine how often the heap should be checked, and other features. We've used a C #pragma command to set it on. But there are other ways to specify this LE option. The disadvantage to HEAPCHK is that it is heavy, and will impact performance. A lot. So, this should only be implemented when it is really needed. One option may be to use HEAPCHK when performing QA testing. Or in production for one batch step that is causing problems. HEAPZONESHEAPCHK is a very heavy option. From z/OS 2.1, HEAPZONES provides an alternative. Rather than check the HEAP regularly, HEAPZONES only checks when the heap is freed. It also allocates an extra piece of storage at the end of each heap zone, and checks this to see if it has been overwritten. We cover HEAPZONES in more detail in our article Using HEAPZONES to Fix C Storage Overlays on z/OS. CICSCICS manages its own storage, though it still uses LE. So, there are some extra features we can use within CICS. CICS programs can be defined with an execution key of either CICS or User. If the CICS SIT parameter STGPROT is set to YES, CICS-key programs can overwrite CICS programs and control blocks, User-key programs cannot. Ensuring C programs are User-key will increase the chance of detecting a storage overlay if the program attempts to access CICS systems storage. This early detection is enhanced by specifying the CICS SIT parameter TRANISO=YES. This ensures that a C program in one transaction cannot overwrite storage in another transaction. TRANISO may affect performance, or increase CPU usage. When a storage overlay is detected by CICS, it is called a storage violation. The CICS SIT option CHKSTRM enables regular checks for violations of a control block called the TIOA. The CHKSTSK parameter enables regular checking of all storage for violations. CHKSTRM and CHKSTSK will also impact performance, and increase CPU. When a storage violation is detected, CICS produces a dump. There are procedures for analysing these dumps. Scott McClure from IBM spelt these out in a webcast he presented in November 2015. Tools for Storage OverlaysIf you can't stop storage overlays, then you're in for some work. Fortunately, there are some tools on z/OS that can help. These will give you some good information, but won't lead you directly to the offending line of code. You'll still need to dive into some dumps. |