LongEx Mainframe Quarterly - February 2024
One of the best things about z/OS, and any product running on z/OS, are the messages they create. Really. When there is an error, these messages are brilliant at figuring out and fixing the problem. They're awesome at finding out what is happening, and even ensuring that what should happen, does. In this article, we are going to see how to decipher these messages, and how we can use them to fix our problems. z/OS MessagesLet's look at an example. I have a batch job that has failed with a return code 16: - -----TIMINGS (MINS.)------ -STEPNAME PROCSTEP RC EXCP CONN TCB SRB CLOCK -STEP1 16 203 0 .00 .00 .0 IEF404I DSTEPHED - ENDED - TIME=17.41.05 -DSTEPHED ENDED. NAME-DESC TOTAL TCB CPU TIME= .00 $HASP395 DSTEPHED ENDED - RC=0016 The first thing I do is to look for error messages, and I find one: IEC031I D37-04,IFG0554P,DSTEPHED,STEP1,OUT1,0AC7,VOL007, DSTEPHE.DCOLL2 Let's look at this message a bit closer. The first thing we see is a message ID: IEC031I. Every z/OS message has a message ID, and it tells us a lot. The first three characters will tell us the software product or z/OS component issuing the message. I talk more about this, and how to find the 'owner' of these three characters in another article. In our case, the message begins with 'IEC': it is issued by the z/OS component DFSMSdfp (commonly just called 'DFP'). In particular, the " Basic (non-VSAM) access methods (BAM)." So, we already know that there's a problem with a non-VSAM dataset. The last letter of our message (I) indicates the 'type' of message, and is documented in z/OS documentation. Common types include:
So, our message is informational: there's no immediate action required to keep z/OS going. There's a lot of information after the message ID, and this differs from message to message. The good news is that every message is documented in IBM documentation. The IBM documentation for our IEC031I message looks like this: So, we know that there was a system D37 abend: you've probably already figured this out (if you don't know what a D37 abend, is, IBM documentation tells all). But bear with me, there's more good stuff in this message. Looking at this documentation and our error message, we know:
Further down in the IBM documentation for our message, it explains return code 4: "A data set opened for output used all the primary space, and no secondary space was requested. Change the JCL specifying a larger primary quantity or add a secondary quantity to the space parameter on the DD statement." So, the manual even tells us how to fix the problem. Brilliant! CICS MessagesLet's look at another message: DFHFC0307 PRODCICS I/O error on file FILE001, component code X'01'. File is temporarily disabled. Our message ID (DFHFC0307) begins with 'DFH': this is a CICS message. CICS messages often tell you the CICS component (or 'domain') that issues the message in the 4th and 5th characters of the message ID. In our case, it is 'FC'. CICS documentation tells us that FC means CICS 'file control.' Again, CICS documentation tell us all about the message: So, we know that there was an I/O error on the base cluster of the CICS file FILE001 in the CICS region PRODCICS. On the same page, the documentation tells us what CICS has done ("Activity against the file is stopped. The file is closed, then reopened in an attempt to release the VSAM output buffers") and gives us some ideas on how to fix the problem ("follow the standard procedure for I/O errors"). If the documentation isn't enough help, a message ID is a great search string for IBM support: there may be an informational APAR or something else to help. Here's a search of IBM support for this CICS message: Websphere Application ServerIt's not just 'traditional' z/OS software that outputs and documents messages. Consider the following message: CWWKG0104W: The ID attribute with value ${izu.ssl.config} for the configuration element ssl must have a fixed value. Variables in the ID attribute will not be replaced. This message (beginning with CWW) is issued by a Websphere Application Server Liberty server. It looks like there's an issue with a configuration file. The IBM documentation tells us more: Again, it tells us how to fix the problem: don't use variables with the ID attribute of the configuration file. Non-IBM ProductsMessage aren't just for IBM products. Most vendors of mainframe software issue messages with a message ID and provide documentation to help us figure out the problem. For example, Broadcom document their CA ACF2 messages: Read the Message!When I was a young systems programmer, every time I went to one of the veterans with a question, they would answer 'RTFM' (read the manual). Wise words I still (try) to follow today. So, whenever I am trying to figure out an error the first, the very first thing I do is to look at the error messages. If I am in any doubt about the error message, I go straight to the documentation to find out more. This has helped me again and again to solve problems faster. |