Longpela Expertise logo
Longpela Expertise Consulting
Longpela Expertise
Home | Press Room | Contact Us | Site Map

LongEx Mainframe Quarterly - August 2011

technical: Lost in Translation 1 - EBCDIC Code Pages

In the first article of a series of three, we look at EBCDIC code pages - what they are, why they're used, and what this means.

From a very early age, most of us are taught about ASCII, and how this is used by computers to convert single byte numbers to the characters we see on our screen. So an 'a' is really 97 as far as the computer is concerned. So imagine my surprise when I found out that mainframes don't use ASCII, but EBCDIC. I remember my reaction: "You've got to be kidding! Didn't EBCDIC die out years ago?"

Nope. And it's not just z/OS that uses it. IBM i, Fujitsu BSD2000/OSD, Unisys MCP, z/VSE, z/VM and z/TPF all happily continue to use EBCDIC today. To them an 'a' is really 129, not 97.

This all worked out fine for many years. In fact EBCDIC was the most popular encoding system in the world until the Personal Computer revolution brought ASCII to the limelight. But EBCIC falls down when we need to display languages other than English. Words like "på" (Swedish) and "brève" (French) need special characters not necessarily available in the standard EBDIC table. What's worse, there's no way that all these special characters for all the languages in the world are going to fit into the 255 places that an eight bit number has. To get around this, IBM created code pages.

EBCDIC Code Pages

Today there's no such thing as a single EBCDIC code table. You can find a few websites that claim to convert from ASCII to EBCDIC. But the chances are that they're really converting from ASCII to EBCDIC code page 37, or EBCDIC 0037. EBCDIC 0037 is the default code page used by the United States and other English speaking countries when working with MVS: the traditional side of z/OS. It has all the normal a-z, A-Z, 0-9 characters, and other symbols like +, () and *. It also includes a few of the foreign characters for when we've borrowed foreign words like "resumé".

However if you're in France, the chances are that you'll be using EBCDIC 0297. In EBCDIC 0297, the standard a-z, A-z and 0-9 characters are the the same as EBCDIC 0037. But to see French words, other characters are used for other numbers. For example, 177 is a pound sign (£) in EBCDIC 0037, and a cross-hash (#) in EBCDIC 0297. Our free EBCDIC code converter tool shows common EBCDIC code pages.

There are many different code pages for all the different regions. From Spain and Iceland to Thailand and Japan. This is not a lot different to ASCII, which has gone from the original 7-bit code ASCII to ISO8859 with its different sub-definitions. For example ISO8859-1 is the standard 'Extended ASCII' that we all love, ISO8859-2 is better for Eastern Europe, and ISO8859-4 for countries like Latvia, Lithuania, Estonia and Greenland.

Every EBCDIC code page uses the same numbers for the standard a-z, A-z, and 0-9 characters, along with a few other standard symbols. So you can see COBOL and JCL code, standard z/OS messages, and update sys1.parmlib the same way, regardless of the code page used. Code pages mess with characters that aren't normally needed when programming or administering the mainframe (remember this statement - we'll talk about it more in a moment).

Of course having characters that move around can be awkward. For example lawyers always want a copyright symbol © displayed on screens. But a © on your screen could be another character entirely in a different character set. This is why you'll often see a (c) rather than © on 3270 screens.

IBM controls these EBCDIC code pages, and assigns an ID to them called the Coded Character Set Identifier (CCSID). The CCSID for EBCDIC 0037 is, you guessed it, 37. IBM also has set CCSIDs for other characters sets - CCSID 1208 is Unicode (UTF-8). You can see them all on IBMs website.

Here's an interesting fact: the original IBM System/360 mainframe could run in either EBCDIC or ASCII. Unfortunately, the operating system couldn't, and this feature was dropped for the System/370. Today there's still no switch or definition to tell z/OS what code page it is using - it happily uses the standard characters that don't change. It leaves you to figure out how you want to view the rest. For a TN3270 user, this is done in the emulation software. Below shows the screen to change the codepage for Giant Software's QWS3270.

How C and REXX Mess Things Up

Before I said that code pages only change characters that aren't needed when programming or administering the mainframe. And that works for traditional z/OS features, and programming languages such as COBOL and PL/1. But there are a couple of exceptions, and you can safely bet that more will follow.
Take REXX for example. A common thing to do in REXX is concatenate two strings. This is done using the two vertical bars ('||'). For example, our sample REXX to get z/OS information has the line:

Say 'z/OS version:' zos_ver || '.' || zos_rel
The bad news is that the vertical bar character '|' isn't one of those 'standard' EBCDIC characters. So a vertical bar in EBCDIC 0037 looks like an exclamation mark (!) in Sweden. The REXX interpreter doesn't care what this character looks like, as long as it has a code of 79. So if you're in Sweden and using EBCDIC 0278, the above line becomes:
Say 'z/OS version:' zos_ver !! '.' !! zos_rel
C is another problem child. It uses funky characters like the square brackets '[]', curly brackets '{}' and broken vertical bar '¦'. These move around (or disappear) depending on your code page. But with C there's another catch: it's designed to use EBCDIC 1047, not EBCDIC 0037. So if you're using arrays in C, the line:
char cvtstuff[140];
is fine if you're using IBM1047. For IBM0037, it becomes:
char cvtstuffÝ140´
If you're using EBCDIC 0050, another common EBCDIC code page, it becomes:
char cvtstuffÝ140"

How USS Doesn't Follow the Rules

So why is C designed for EBCDIC 1047? Because z/OS Unix Systems Services (USS) is also designed for it.

When IBM created USS for z/OS, it makes sense that it had to work in EBCDIC. The POSIX standard for UNIX doesn't require the use of ASCII, and z/OS is an EBCDIC operating system. IBM really didn't have a choice.

The problem is that UNIX, and its core programming language C, rely on characters that don't exist in some EBCDIC codepages. EBCDIC 1047 is designed to include all the characters USS needs - effectively all the characters from Extended ASCII: ISO8859-1. So EBCDIC 1047 is the default EBCDIC codepage used in USS. All parameter and help files are usually supplied in EBCDIC 1047, the C compiler expects code in EBCDIC 1047, and all UNIX file contents default to EBCDIC 1047. If you decide to use something else, they may look a little funny.

Dealing With Larger Alphabets

Up to now, our character sets have had one byte per character: Single Byte Character Sets (SBCS). However anyone who speaks Japanese, Chinese or Korean will laugh at the idea of their character set fitting into a mere 255 places. The EBCDIC solution: using two bytes for each character - double Byte Character Sets (DBCS). In reality, the chances are that these character sets will use a combination for better performance: Latin characters as single-byte, Asian as double-byte. This is called a Mixed Byte Character Sets (MBCS).

DBCS support is still very limited, and isn't enabled by default. DB2 supports DBCS, as does TCP/IP and many TCP/IP applications such as FTP and SMTP. Unicode Services supports DBCS, and includes some DBCS conversion tables. Websphere MQ is also up for DBCS.

What This Means

For many years I was ignorant of code pages and language issues. My first wakeup call was when I tried to program in C from ISPF Edit. Another wakeup call happened when I first needed to convert from EBCDIC to UTF-8, a conversion that needs to know the EBCDIC code page used.

The fact is that there are some basic rules that aren't written down, but are very important:

  1. Only use 'standard' EBCDIC characters if your dataset may be used somewhere else, and you want it to look the same. For example, any product parameter datasets should keep it to 'a-z, 'A-Z' or '0-9'.
  2. Be aware that the standard EBCDIC code pages may be different between TSO/ISPF and USS.
  3. If you are in different countries, be aware of the character set your emulator uses.
  4. Know your EBCDIC code page whenever converting between EBCDIC and another character set like ASCII or UTF-8.
A basic understanding of code pages is more than handy, it's important. Particularly today where computers supply information to users worldwide, and where information must be converted between the mainframe and other computers.

This first article has taken a quick look at EBCDIC code pages, and what they mean. In the next part of this series of three I'll look at how ASCII, Unicode and EBCDIC work together on z/OS.

David Stephens

LongEx Quarterly is a quarterly eZine produced by Longpela Expertise. It provides Mainframe articles for management and technical experts. It is published every November, February, May and August.

The opinions in this article are solely those of the author, and do not necessarily represent the opinions of any other person or organisation. All trademarks, trade names, service marks and logos referenced in these articles belong to their respective companies.

Although Longpela Expertise may be paid by organisations reprinting our articles, all articles are independent. Longpela Expertise has not been paid money by any vendor or company to write any articles appearing in our e-zine.

Inside This Month

Printer Friendly Version

Read Previous Articles

Longpela Expertise are mainframe technical experts: from coding and administration to management, problem solving and training. Contact us to get your own mainframe expert.
© Copyright 2011 Longpela Expertise  |  ABN 55 072 652 147
Legal Disclaimer | Privacy Policy Australia
Website Design: Hecate Jay