LongEx Mainframe Quarterly - August 2011
In the first article of a series of three, we look at EBCDIC code pages - what they are, why they're used, and what this means. From a very early age, most of us are taught about ASCII, and how this is used by computers to convert single byte numbers to the characters we see on our screen. So an 'a' is really 97 as far as the computer is concerned. So imagine my surprise when I found out that mainframes don't use ASCII, but EBCDIC. I remember my reaction: "You've got to be kidding! Didn't EBCDIC die out years ago?" Nope. And it's not just z/OS that uses it. IBM i, Fujitsu BSD2000/OSD, Unisys MCP, z/VSE, z/VM and z/TPF all happily continue to use EBCDIC today. To them an 'a' is really 129, not 97. This all worked out fine for many years. In fact EBCDIC was the most popular encoding system in the world until the Personal Computer revolution brought ASCII to the limelight. But EBCIC falls down when we need to display languages other than English. Words like "på" (Swedish) and "brève" (French) need special characters not necessarily available in the standard EBDIC table. What's worse, there's no way that all these special characters for all the languages in the world are going to fit into the 255 places that an eight bit number has. To get around this, IBM created code pages. EBCDIC Code PagesToday there's no such thing as a single EBCDIC code table. You can find a few websites that claim to convert from ASCII to EBCDIC. But the chances are that they're really converting from ASCII to EBCDIC code page 37, or EBCDIC 0037. EBCDIC 0037 is the default code page used by the United States and other English speaking countries when working with MVS: the traditional side of z/OS. It has all the normal a-z, A-Z, 0-9 characters, and other symbols like +, () and *. It also includes a few of the foreign characters for when we've borrowed foreign words like "resumé". However if you're in France, the chances are that you'll be using EBCDIC 0297. In EBCDIC 0297, the standard a-z, A-z and 0-9 characters are the the same as EBCDIC 0037. But to see French words, other characters are used for other numbers. For example, 177 is a pound sign (£) in EBCDIC 0037, and a cross-hash (#) in EBCDIC 0297. Our free EBCDIC code converter tool shows common EBCDIC code pages. There are many different code pages for all the different regions. From Spain and Iceland to Thailand and Japan. This is not a lot different to ASCII, which has gone from the original 7-bit code ASCII to ISO8859 with its different sub-definitions. For example ISO8859-1 is the standard 'Extended ASCII' that we all love, ISO8859-2 is better for Eastern Europe, and ISO8859-4 for countries like Latvia, Lithuania, Estonia and Greenland. Every EBCDIC code page uses the same numbers for the standard a-z, A-z, and 0-9 characters, along with a few other standard symbols. So you can see COBOL and JCL code, standard z/OS messages, and update sys1.parmlib the same way, regardless of the code page used. Code pages mess with characters that aren't normally needed when programming or administering the mainframe (remember this statement - we'll talk about it more in a moment). Of course having characters that move around can be awkward. For example lawyers always want a copyright symbol © displayed on screens. But a © on your screen could be another character entirely in a different character set. This is why you'll often see a (c) rather than © on 3270 screens. IBM controls these EBCDIC code pages, and assigns an ID to them called the Coded Character Set Identifier (CCSID). The CCSID for EBCDIC 0037 is, you guessed it, 37. IBM also has set CCSIDs for other characters sets - CCSID 1208 is Unicode (UTF-8). You can see them all on IBMs website. Here's an interesting fact: the original IBM System/360 mainframe could run in either EBCDIC or ASCII. Unfortunately, the operating system couldn't, and this feature was dropped for the System/370. Today there's still no switch or definition to tell z/OS what code page it is using - it happily uses the standard characters that don't change. It leaves you to figure out how you want to view the rest. For a TN3270 user, this is done in the emulation software. Below shows the screen to change the codepage for Giant Software's QWS3270. How C and REXX Mess Things Up Before I said that code pages only change characters that aren't needed when
programming or administering the mainframe. And that works for traditional z/OS
features, and programming languages such as COBOL and PL/1. But there are a
couple of exceptions, and you can safely bet that more will follow. Say 'z/OS version:' zos_ver || '.' || zos_relThe bad news is that the vertical bar character '|' isn't one of those 'standard' EBCDIC characters. So a vertical bar in EBCDIC 0037 looks like an exclamation mark (!) in Sweden. The REXX interpreter doesn't care what this character looks like, as long as it has a code of 79. So if you're in Sweden and using EBCDIC 0278, the above line becomes: Say 'z/OS version:' zos_ver !! '.' !! zos_relC is another problem child. It uses funky characters like the square brackets '[]', curly brackets '{}' and broken vertical bar '¦'. These move around (or disappear) depending on your code page. But with C there's another catch: it's designed to use EBCDIC 1047, not EBCDIC 0037. So if you're using arrays in C, the line: char cvtstuff[140];is fine if you're using IBM1047. For IBM0037, it becomes: char cvtstuffÝ140´If you're using EBCDIC 0050, another common EBCDIC code page, it becomes: char cvtstuffÝ140" How USS Doesn't Follow the RulesSo why is C designed for EBCDIC 1047? Because z/OS Unix Systems Services (USS) is also designed for it. When IBM created USS for z/OS, it makes sense that it had to work in EBCDIC. The POSIX standard for UNIX doesn't require the use of ASCII, and z/OS is an EBCDIC operating system. IBM really didn't have a choice. The problem is that UNIX, and its core programming language C, rely on characters that don't exist in some EBCDIC codepages. EBCDIC 1047 is designed to include all the characters USS needs - effectively all the characters from Extended ASCII: ISO8859-1. So EBCDIC 1047 is the default EBCDIC codepage used in USS. All parameter and help files are usually supplied in EBCDIC 1047, the C compiler expects code in EBCDIC 1047, and all UNIX file contents default to EBCDIC 1047. If you decide to use something else, they may look a little funny.
Dealing With Larger AlphabetsUp to now, our character sets have had one byte per character: Single Byte Character Sets (SBCS). However anyone who speaks Japanese, Chinese or Korean will laugh at the idea of their character set fitting into a mere 255 places. The EBCDIC solution: using two bytes for each character - double Byte Character Sets (DBCS). In reality, the chances are that these character sets will use a combination for better performance: Latin characters as single-byte, Asian as double-byte. This is called a Mixed Byte Character Sets (MBCS). DBCS support is still very limited, and isn't enabled by default. DB2 supports DBCS, as does TCP/IP and many TCP/IP applications such as FTP and SMTP. Unicode Services supports DBCS, and includes some DBCS conversion tables. Websphere MQ is also up for DBCS.
What This MeansFor many years I was ignorant of code pages and language issues. My first wakeup call was when I tried to program in C from ISPF Edit. Another wakeup call happened when I first needed to convert from EBCDIC to UTF-8, a conversion that needs to know the EBCDIC code page used. The fact is that there are some basic rules that aren't written down, but are very important:
This first article has taken a quick look at EBCDIC code pages, and what they mean. In the next part of this series of three I'll look at how ASCII, Unicode and EBCDIC work together on z/OS. |