Characters, Character Sets and Encodings

  1. https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/
  2. https://medium.com/@apiltamang/unicode-utf-8-and-ascii-encodings-made-easy-5bfbe3a1c45a

All that stuff about “plain text = ascii = characters are 8 bits” is not only wrong, it’s hopelessly wrong.

EBCDIC

EBCDIC is not relevant to our life. We don’t have to go that far back in time.

ASCII

American Standard Code for Information Interchange

2^8 (1 byte) = 256 unique values

This was used in the C programming language.

ASCII was able to represent every character using a number between 32 and 127. Space was 32, the letter “A” was 65, etc. This could conveniently be stored in 7 bits.

Every letter, digit, and symbol that mattered (a-z, A-Z, 0–9, +, -, /, “, ! etc.)

Most computers in those days were using 8-bit bytes then. This meant each byte could store 2⁸-1= 255 numbers.

127 - 32 = 95

255 - 95 = 160

So each byte (or unit of storage) had more than enough space to store the basic set of english characters.

Codes below 32 were used for control characters, like 7 which made your computer beep and 12 which caused the current page of paper to go flying out of the printer and a new one to be fed in.

You had a whole bit to spare (128 - 255), which could be used for your various purposes.

Lots of people tried to use this extra set for different purposes.

Life was good (assuming you were an English speaker). Heck they even sent people to the moon!

Trouble with this

In order to accommodate the non-english characters, people started going a little crazy on how to use the numbers from 128 to 255 still available on a single byte.

There were a lot of issues with internationalization. See Joel’s blog post.

Different people would use different characters for the same numbers. Obviously, not only was it the wild wild west, but it quickly dawned that the extra available numbers could not even come close to represent the complete set of characters for some languages.

Unicode

The troubles with ASCII system required a paradigm shift in how to interpret characters. And in this new paradigm, each character was an idealized abstract entity. Also in this system, rather than use a number, each character was represented as a code-point. The code-points were written as: U+00639, where U stands for ‘Unicode’, and the numbers are hexadecimal.

Unicode was a brave effort to create a single character set that included every reasonable writing system on the planet.

2^32 (4 bytes) = 4,294,967,296 unique values.

More than enough to account for every character in every language.

Some people are under the misconception that Unicode is simply a 16-bit code where each character takes 16 bits and therefore there are 65,536 possible characters. This is not, actually, correct.

Unicode has a different way of thinking about characters, and you have to understand the Unicode way of thinking of things or nothing will make sense.

Until now, we’ve assumed that a letter maps to some bits which you can store on disk or in memory:

A -> 0100 0001

In Unicode, a letter maps to something called a code point.

Every platonic letter in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639. This magic number is called a code point. The U+ means “Unicode” and the numbers are hexadecimal. U+0639 is the Arabic letter Ain. The English letter A would be U+0041. You can find them all using the charmap utility on Windows 2000/XP or visiting the Unicode web site.

OK, so say we have a string:

Hello

which, in Unicode, corresponds to these five code points:

U+0048 U+0065 U+006C U+006C U+006F.

UTF-8

(up to 4 bytes)

  1. stores unicode as binary

  2. If a character needs 1 byte that’s all it will use.

  3. If a character needs 4 bytes it will use 4 bytes.

  4. variable length encoding = efficient memory use

  5. common characters like “C” take 8 bits,

  6. rare characters like “💕” take 32 bits.

  7. Other characters take 16 or 24 bits.

  8. https://blog.hubspot.com/website/what-is-utf-8

  9. https://deliciousbrains.com/how-unicode-works/

The final piece we’re missing at this point is a system for storing and representing these code-points. This is what the encoding standards serve. After a couple of hits and misses, the UTF-8 encoding standard was born.

In UTF-8, every code-point from 0–127 is stored in a single byte. Code points above 128 are stored using 2, 3, and in fact, up to 6 bytes.

Each byte consists of 8 bits, and the number of allowed information increases exponentially with the bits used for storage. Thus with 6 bytes (and not necessarily always 6), one could store as many as 2⁴⁸ characters. That would make for a very (very) large scrabble game!

UTF-8 has the neat side effect that English text looks exactly the same in UTF-8 as it did in ASCII, so Americans don’t even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored as 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM character set on the planet. Now, if you are so bold as to use accented letters or Greek letters or Klingon letters, you’ll have to use several bytes to store a single code point, but the Americans will never notice. (UTF-8 also has the nice property that ignorant old string-processing code that wants to use a single 0 byte as the null-terminator will not truncate strings).

UTF-8 saves space. In UTF-8, common characters like “C” take 8 bits, while rare characters like “💕” take 32 bits. Other characters take 16 or 24 bits. A blog post takes about four times less space in UTF-8 than it would in UTF-32. So it loads four times faster.

UTF-16

Now that we know what UTF-8 is, extrapolating our understanding to UTF-16 should be fairly straight-forward. UTF-8 is named for how it uses a minimum of 8 bits (or 1 byte) to store the unicode code-points. Remember that it can still use more bits, but does so only if it needs to.

UTF-16 uses a minimum of 16 bits (or 2 bytes) to encode the unicode characters.

Java natively uses this encoding. A Java char (Character) is thus a UTF-16 encoded unicode character and occupies a memory space of 2 bytes. The numerical value ranges from

  1. 0x0000 (numerical value: 0) to
  2. 0xFFFF (numerical value: 65,536)

This also means UTF-16 is NO longer backwards compatible with ASCII. Remember ASCII only used 1 byte or 8 bits.

The Single Most Important Fact About Encodings

It does not make sense to have a string without knowing what encoding it uses. You can no longer stick your head in the sand and pretend that “plain” text is ASCII.

There Ain’t No Such Thing As Plain Text.

If you have a string, in memory, in a file, or in an email message, you have to know what encoding it is in or you cannot interpret it or display it to users correctly.

You will occasionally encounter this string in the header of the form: Content-Type: text/plain; charset="UTF-8" in the code. This essentially tells the browsers (or any text parsers) the specific encoding being used.


Links to this note