technical: Lost in Translation 1 - EBCDIC Code Pages
In the first article of a series of three, we look at EBCDIC code pages
- what they are, why they're used, and what this means.
From a very early age, most of us are taught about ASCII, and how this is used
by computers to convert single byte numbers to the characters we see on our
screen. So an 'a' is really 97 as far as the computer is concerned. So imagine
my surprise when I found out that mainframes don't use ASCII, but EBCDIC. I
remember my reaction: "You've got to be kidding! Didn't EBCDIC die out
years ago?"
Nope. And it's not just z/OS that uses it. IBM i, Fujitsu BSD2000/OSD, Unisys
MCP, z/VSE, z/VM and z/TPF all happily continue to use EBCDIC today. To them
an 'a' is really 129, not 97.
This all worked out fine for many years. In fact EBCDIC was the most popular
encoding system in the world until the Personal Computer revolution brought
ASCII to the limelight. But EBCIC falls down when we need to display languages
other than English. Words like "på" (Swedish) and "brève"
(French) need special characters not necessarily available in the standard EBDIC
table. What's worse, there's no way that all these special characters for all
the languages in the world are going to fit into the 255 places that an eight
bit number has. To get around this, IBM created code pages.
EBCDIC Code Pages
Today there's no such thing as a single EBCDIC code table. You can find a
few websites that claim to convert from ASCII to EBCDIC. But the chances are
that they're really converting from ASCII to EBCDIC code page 37, or EBCDIC
0037. EBCDIC 0037 is the default code page used by the United States and other
English speaking countries when working with MVS: the traditional side of z/OS.
It has all the normal a-z, A-Z, 0-9 characters, and other symbols like +, ()
and *. It also includes a few of the foreign characters for when we've borrowed
foreign words like "resumé".
However if you're in France, the chances are that you'll be using EBCDIC 0297.
In EBCDIC 0297, the standard a-z, A-z and 0-9 characters are the the same as
EBCDIC 0037. But to see French words, other characters are used for other numbers.
For example, 177 is a pound sign (£) in EBCDIC 0037, and a cross-hash
(#) in EBCDIC 0297. Our free
EBCDIC code converter tool shows common EBCDIC code pages.
There are many different code pages for all the different regions. From Spain
and Iceland to Thailand and Japan. This is not a lot different to ASCII, which
has gone from the original 7-bit code ASCII to ISO8859 with its different sub-definitions.
For example ISO8859-1 is the standard 'Extended ASCII' that we all love, ISO8859-2
is better for Eastern Europe, and ISO8859-4 for countries like Latvia, Lithuania,
Estonia and Greenland.
Every EBCDIC code page uses the same numbers for the standard a-z, A-z, and
0-9 characters, along with a few other standard symbols. So you can see COBOL
and JCL code, standard z/OS messages, and update sys1.parmlib the same way,
regardless of the code page used. Code pages mess with characters that aren't
normally needed when programming or administering the mainframe (remember this
statement - we'll talk about it more in a moment).
Of course having characters that move around can be awkward. For example lawyers
always want a copyright symbol © displayed on screens. But a © on
your screen could be another character entirely in a different character set.
This is why you'll often see a (c) rather than © on 3270 screens.
IBM controls these EBCDIC code pages, and assigns an ID to them called the
Coded Character Set Identifier (CCSID). The CCSID for EBCDIC 0037 is, you guessed
it, 37. IBM also has set CCSIDs for other characters sets - CCSID 1208 is Unicode
(UTF-8). You can see them all on IBMs
website.
Here's an interesting fact: the original IBM System/360 mainframe could run
in either EBCDIC or ASCII. Unfortunately, the operating system couldn't, and
this feature was dropped for the System/370. Today there's still no switch or
definition to tell z/OS what code page it is using - it happily uses the standard
characters that don't change. It leaves you to figure out how you want to view
the rest. For a TN3270 user, this is done in the emulation software. Below shows
the screen to change the codepage for Giant Software's QWS3270.
How C and REXX Mess Things Up
Before I said that code pages only change characters that aren't needed when
programming or administering the mainframe. And that works for traditional z/OS
features, and programming languages such as COBOL and PL/1. But there are a
couple of exceptions, and you can safely bet that more will follow.
Take REXX for example. A common thing to do in REXX is concatenate two strings.
This is done using the two vertical bars ('||'). For example, our sample REXX
to get z/OS information has the line:
Say 'z/OS version:' zos_ver || '.' || zos_rel
The bad news is that the vertical bar character '|' isn't one of those 'standard'
EBCDIC characters. So a vertical bar in EBCDIC 0037 looks like an exclamation
mark (!) in Sweden. The REXX interpreter doesn't care what this character looks
like, as long as it has a code of 79. So if you're in Sweden and using EBCDIC
0278, the above line becomes:
Say 'z/OS version:' zos_ver !! '.' !! zos_rel
C is another problem child. It uses funky characters like the square brackets
'[]', curly brackets '{}' and broken vertical bar '¦'. These move around
(or disappear) depending on your code page. But with C there's another catch:
it's designed to use EBCDIC 1047, not EBCDIC 0037. So if you're using arrays in
C, the line:
char cvtstuff[140];
is fine if you're using IBM1047. For IBM0037, it becomes:
char cvtstuffÝ140´
If you're using EBCDIC 0050, another common EBCDIC code page, it becomes:
char cvtstuffÝ140"
How USS Doesn't Follow the Rules
So why is C designed for EBCDIC 1047? Because z/OS Unix Systems Services (USS)
is also designed for it.
When IBM created USS for z/OS, it makes sense that it had to work in EBCDIC.
The POSIX standard for UNIX doesn't require the use of ASCII, and z/OS is an
EBCDIC operating system. IBM really didn't have a choice.
The problem is that UNIX, and its core programming language C, rely on characters
that don't exist in some EBCDIC codepages. EBCDIC 1047 is designed to include
all the characters USS needs - effectively all the characters from Extended
ASCII: ISO8859-1. So EBCDIC 1047 is the default EBCDIC codepage used in USS.
All parameter and help files are usually supplied in EBCDIC 1047, the C compiler
expects code in EBCDIC 1047, and all UNIX file contents default to EBCDIC 1047.
If you decide to use something else, they may look a little funny.
Dealing With Larger Alphabets
Up to now, our character sets have had one byte per character: Single Byte
Character Sets (SBCS). However anyone who speaks Japanese, Chinese or Korean
will laugh at the idea of their character set fitting into a mere 255 places.
The EBCDIC solution: using two bytes for each character - double Byte Character
Sets (DBCS). In reality, the chances are that these character sets will use
a combination for better performance: Latin characters as single-byte, Asian
as double-byte. This is called a Mixed Byte Character Sets (MBCS).
DBCS support is still very limited, and isn't enabled by default. DB2 supports
DBCS, as does TCP/IP and many TCP/IP applications such as FTP and SMTP. Unicode
Services supports DBCS, and includes some DBCS conversion tables. Websphere
MQ is also up for DBCS.
What This Means
For many years I was ignorant of code pages and language issues. My first
wakeup call was when I tried to program in C from ISPF Edit. Another wakeup
call happened when I first needed to convert from EBCDIC to UTF-8, a conversion
that needs to know the EBCDIC code page used.
The fact is that there are some basic rules that aren't written down, but
are very important:
- Only use 'standard' EBCDIC characters if your dataset may be used somewhere
else, and you want it to look the same. For example, any product parameter
datasets should keep it to 'a-z, 'A-Z' or '0-9'.
- Be aware that the standard EBCDIC code pages may be different between TSO/ISPF
and USS.
- If you are in different countries, be aware of the character set your emulator
uses.
- Know your EBCDIC code page whenever converting between EBCDIC and another
character set like ASCII or UTF-8.
A basic understanding of code pages is more than handy, it's important. Particularly
today where computers supply information to users worldwide, and where information
must be converted between the mainframe and other computers.
This first article has taken a quick look at EBCDIC code pages, and what they
mean. In the next
part of this series of three I'll look at how ASCII, Unicode and EBCDIC
work together on z/OS.
David Stephens
|