Skip to content
May 24, 2011 / Rohit

Unicode: You should care.

Like many ignorant programmers I too never bothered to look at Unicode closely. For me characters were 1 byte long and were in ASCII. When I moved to languages such as Java and C#, the characters became 2 bytes long and that was that. I knew that the reason for that was Unicode, but never bothered to look deeper or think of its implications. In my simplistic view all the world’s characters could be accommodated in those 16 bits. How wrong was I?

Unicode

Unicode essentially is a way to represent characters. If you are familiar with ASCII you know that the characters from that scheme are represented as numbers 0-127. In the case of extended ASCII (or ISO-8859-1) it’s from 0-256. This became the standard to represent English characters, but what about other languages? People started developing their own encoding schemes for their languages. This became a nightmare for interoperability. What would happen if someone sent an email encoded in ASCII to someone else who is using another encoding scheme — garbled text.

Thus a need was felt for a universal character representation scheme and Unicode was invented. Each character in Unicode is represented as a code point — which is nothing but a number. Currently there are 1,114,112 code points.

Not 16 bits!

If you read the last paragraph of the last sentence, you would realize that Unicode characters cannot be represented in 16 bits.

16 bits = 216= 65536 ≠1114112

Remember that! Unicode characters cannot be represented in 16 bits. This was a bit confusing to me at first, because when I read C# and Java, I was told that characters are 2 bytes long, but now they aren’t? Was I being lied to? Well yes and no. Initially it was thought that all the characters can be represented in the 16-bits and the UTF-16 encoding was supposed to take care of that. The code points would be translated into 2 bytes, there might be a byte order marker at the beginning of the file (to distinguish between a big endian or little endian byte order), and that was that. But soon the Unicode people found out that they were running out of code points and had to extend the character space to go over 16 bits. This meant that not all Unicode characters could be represented in 16 bits. The new characters would require 4 bytes to represent.


So in the UTF-16 scheme, the characters might be either 2 bytes or 4 bytes long. It is wrong to assume that in UTF-16 all characters are 2 bytes long. This can lead to subtle bugs, wherein a program might work on the basic character set, but not work with characters whose code points are greater than 2^16.

One way to solve this problem is to expand the encoding scheme to represent the code points with 4 bytes. And that’s exactly what UTF-32 does. The problem though is the obvious space wastage. For text which is all in English, the space taken will be twice as much as in UTF-16 and four times as much if the encoding was ASCII. It should be noted that the default wide characters (wchar_t) in C and C++  (on Linux at least) are 32-bits per character.

UTF-8

Along came the Unix gods, Rob Pike and Ken Thompson, and solved the space bloat problem for us by giving us UTF-8. It’s a brilliantly devised encoding scheme and is described in detail here in the original paper. Basically the UTF-8 representation needs only the same amount of space as ASCII to store the Latin character-set. This is because any byte sequence in UTF-8 whose most significant bit is a 0 is considered to map to the same character as the rest of the 7 bits would in ASCII. For other characters the encoding uses additional bytes. The beauty of this is the fact that, this results in backwards compatibility. All your programs which were supposed to work with ASCII, won’t break if you switched to Unicode with UTF-8.

As far as I know most of the common utilities on Linux support UTF-8. The Go programming language (which was partly designed by Pike and Thompson) supports UTF-8 natively.

There are some disadvantages with UTF-8 too though. As Tim Bray points out:

Let’s address the problem first: UTF-8 is kind of racist. It allows us round-eye paleface anglophone types to tuck our characters neatly into one byte, lets most people whose languages are headquartered west of the Indus river get away with two bytes per, and penalizes India and points east by requiring them to use three bytes per character.

This is a serious problem, but it’s not a technical problem. All that bit-twiddling turns out to be easy to implement in very efficient code; I’ve done it a few times, basically reading the rules and composing all the shifts and masks and so on, and gotten it pretty well right first time, each time. In fact, processing UTF-8 characters sequentially is about as efficient, for practical purposes, as any other encoding.

There is one exception: you can’t easily index into a buffer. If you need the 27th character, you’re going to have to run through the previous twenty-six characters to figure out where it starts. Of course, UTF-16 has this problem too, unless you’re willing to bet your future on never having to use astral-plane characters and pretend that Unicode characters are 16 bits, which they are (except when they’re not).

Conclusion

If you didn’t read anything here are few things to take away:

  1. Other than UTF-32 none of the Unicode encoding schemes ensure a fixed number of bytes for a character.
  2. I am biased towards the UTF-8 encoding scheme and feel that everyone should use it over other alternatives.
  3. You should care about Unicode!

Further Reading

  1. http://www.joelonsoftware.com/articles/Unicode.html
  2. http://www.tbray.org/ongoing/When/200x/2003/04/06/Unicode
  3. http://www.tbray.org/ongoing/When/200x/2003/04/26/UTF
  4. http://www.tbray.org/ongoing/When/200x/2003/04/30/JavaStrings
  5. http://benlynn.blogspot.com/2011/02/utf-8-good-utf-16-bad.html
  6. http://research.swtch.com/2010/03/utf-8-bits-bytes-and-benefits.html
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

w

Connecting to %s

%d bloggers like this: