ASCII, Unicode, UTF-8, UTF-16, Latin-1 and why they matter

In iOS, you have a string “hello world”. Most of the time you just need to assign it to a textLabel.text, uppercaseString it or stringByAppendingString it to another string.

If you stop to look under the hood of an NSString, and you have a deep complicated world with a rich history  with tales of international conflict, competing standards and a conservation movement.

Why does this matter?

If you ever wondered imported a weird Word document before and seen a page with boxes and ?  instead of accented letters, it’s because of encoding.

If you want your website to show up in other languages, you need to understand encoding.

 

Under the hood of NSString (in Objective-c)

If you want the first letter of an NSString, you can use characterAtIndex:

NSString *exampleString = "Hello world"
unichar c = [exampleString characterAtIndex:0];

You get this unichar.

What’s a unichar?

typedef unsigned short unichar;

It’s a number. For example, c =  “H”, 72 represents “H” when your computer uses UTF-16 to translate between numbers and symbol.

What is UTF-16?

Unicode Transformation Format 16-bit

UTF-16 translates between a number and the Unicode symbol that it represents. Each of it’s code units are 16 bits. Some characters use one code unit, some need two.

Example Unicode chart (number under symbols are in base 16, so H = 0048, which is 72).

What is Unicode?

Computers don’t know what an “H” is. We need to tell it how to draw an “H”, so we use numbers to represent the H.

ASCII was an early way of translating between numbers and symbols, but it really basic. It could only represent 127 symbols and you only get numbers and letters and some punctuation. You only needed 7-bits to represent any ASCII symbol.

What happens when you don’t speak English?

ASCII doesn’t let you make accented letters in Spanish and French. So, if you’re trying to read French, it won’t have the accents or won’t show those letter at all. Forget about trying to reach Chinese or Russian because they have completely different characters and ASCII has no idea how to show those characters.

Obviously, people in France and Russia and China wanted their internet to show their native language, so they made systems for translating numbers to symbols too. The hard part was that many of these systems overlapped and only held one or a subset of language. Latin-1 was one of these encoding that had incomplete coverage.

How do we solve this problem?

We need a new system that has all the English characters, all the accents and all the Russian and Chinese characters too and every other possible symbol like emojis. That’s Unicode.

Check out all the code charts.

Does this actually work? Let’s find out

I looked at the Arabic code chart. So what I’m saying is that if you put in the number under any of these Arabic characters into a NSString, then it’ll show up on my screen?

Well yeah. Let’s try it.

Here are three random Arabic characters. Took the hexadecimal under the characters and converted them into numbers (1692, 1693, 1695) and then put them into an NSString with spaces in between on a Swift Playground.

Screen Shot 2016-10-20 at 9.58.22 PM.pngScreen Shot 2016-10-20 at 9.58.33 PM.pngScreen Shot 2016-10-20 at 9.58.48 PM.png

Screen Shot 2016-10-20 at 10.02.25 PM.png

Yay it works! (Arabic is read right to left.) 😎

Screen Shot 2016-10-20 at 9.58.56 PM.png

What’s UTF-8?

Unicode Transformation Format 8-bit. Each code unit is only 8-bits instead of 16. It means that each code unit can only be one of 256 characters (2^8) instead of the 65,536 characters (2^16) you could potentially have with 16 bits. So is less more here?

In English, most of the characters that we use are in the range of 65 (0041 hexadecimal) to 122 (007A hexadecimal). The first 8 bits are always 00, so some people thought it would be a good idea to get rid of them to save space.

In UTF-16, storing “H” in memory requires one 16-bit unit.

In UTF-8, storing “H” requires one 8-bit unit. Your English characters take half as much space.

But what if I need to show our Arabic character above?

You just need more 8-bit units to represent them. In this case, two will do.

It’s really nice to be able to assume that one code unit is one character, but if you make the code unit too small, it means that you need to have more units.

The tradeoff between UTF-8 and UTF-16 is between having code units that are big enough to contain all or most of the characters you need and conserving space.

There’s also UTF-32, where each code unit is 32 bits. You can make any character with one code unit, but for most of your characters you’ll be using up a lot of useless space.

Right now UTF-8 is the de-facto standard on the web at 87.8% of web sites.

What is character encoding all about?

The story of Unicode is a success story of how people came together when faced with a difficult problem of incompatible systems and actually made one system that works for everyone.

This story also shows how connected the world is now that we need to be able to talk to each other in other countries and how the opportunities of the web are accessible to anyone with an internet connection.

 

Further reading:

Inspiration for biting the bullet and actually figuring this stuff out. 

An excellent explanation of how UTF-8 works.