A Beginner’s Guide to Understanding Unicode for the Application Programmer and Web Developer.

22nd December 2011 by Ian Martin

A+Beginner%e2%80%99s+Guide+to+Understanding+Unicode+for+the+Application+Programmer+and+Web+Developer. Image

A Beginner’s Guide to Understanding Unicode for the Application Programmer and Web Developer.

22nd December 2011 by Ian Martin

1 Introduction

To developers who speak and work in English, French, Spanish and other Western European languages, Unicode character encoding is often either ignored, considered late or even retro-fitted to projects, whilst the development team may have only a rudimentary understanding of Unicode and character encoding systems such as UTF-8.

The concepts are quite straightforward and can easily be grasped and whilst a detailed knowledge of, for example, the encoding system used by UTF-8 is not necessary, a general understanding of Unicode and how it is implemented in today’s applications and web servers is useful.

2 ISO-8859-1 and ASCII

2.1 Background

Before looking at Unicode let us examine a simple 8 bit byte-per character encoding scheme.

ISO-8859-1 is an 8-bit single byte character encoding scheme for Western European languages.

The first 128 characters of ISO-8859-1 is the original ASCII character-set (the numbers from 0-9, the uppercase and lowercase English alphabet, and some special characters).

The higher part of ISO-8859-1 (codes from 160-255) contains the characters used in Western European countries and some commonly used special characters.

ISO-8859-1 thus provides character encoding for English and many Western European languages. Character encoding is simple (a character is a single byte) and space efficient, but the scope is limited to 255 possible characters.

2.2 Example Use

In web server usage, if a web page will not contain any non ISO-8859-1 characters, the web server character set may be set to ISO-8859-1;  a generic method is shown below:-

<meta http-equiv="content-type" content="text/html; charset=iso-8859-1" />

A JSP page fragment is shown below:-

<%@ page contentType="text/html;charset= iso-8859-1" %>

2.3 Example Character

An example ISO-8859-1 character is the pound sterling £character, which is code 163 decimal or 0xA3 hexadecimal.

3 Bytes and Characters Confusion

A byte is (almost always) an 8-bit unit of information.  Historically, however, ASCII text strings were represented as a single byte per character, and the word “character” was used as a synonym for “byte”.

Furthermore, the ANSI C byte storage unit is char and since this was used to store a single ASCII character, the terms byte, char and character were often used interchangeably in documentation.

Today’s developer should consider bytes and characters as distinct and different units at all times.

4 How Unicode is Organised

4.1 Introduction

Unicode extends the range of code points far beyond the 8 bit limits of ISO-8859-1 and byte storage. Unicode characters have been assigned for virtually all spoken and written languages, plus mathematical and other symbols, and historic languages.

4.2 Code Points

A code point is simply the numerical value that defines a given character within the Unicode definition.

4.3 ASCII in Unicode

The ASCII characters, with which English speaking developers are so familiar, are of course a subset of the Unicode code point range.

The code points 0-U 7F (or 0-127 decimal) are ASCII characters; an ASCII code is thus identical to its Unicode code point value.

4.4 Unicode Planes

A Unicode plane is a collection of 216 (65536 or 0x10000 hex) code points.

The first plane, that is the code points 0 - U FFFF, is called the Basic Multilingual Plane (BMP). This comprises the most Unicode characters that have been assigned so far. The BMP contains characters for almost all modern languages, and a large number of special characters.

It will be noted that the BMP can be represented in 16 bits; in other words, the Unicode characters that are used in most instances are able to be represented in 16 bits.

4.5 Example Unicode Characters

  • A (U+41)
  • £ (U+A3)
  • χ (U+3c7)
  • 𝄞 Treble Clef (U+1d11e). Note that this lies outside of the Basic Multilingual Plane (BMP).

4.6 Unicode Character Pages (Partial List)

These are some of the principal character page locations within Unicode. This is not intended to be a comprehensive list.

4.6.1. Basic Multilingual Plane (BMP) 0 - U FFFF Most of the Unicode characters assigned lie in the BMP. These are just a few of the blocks of code points:-

  • C0 Controls and Basic Latin (0000-007F)
  • C1 Controls and Latin-1 Supplement (0080-00FF)
  • Latin Extended-A (0100-017F)
  • Latin Extended-B (0180-024F)
  • Greek and Coptic (0370-03FF)
  • Cyrillic (0400-04FF)
  • Armenian (0530-058F)
  • Hebrew (0590-05FF)
  • Arabic (0600-06FF)
  • Syriac (0700-074F)
  • Bengali (0980-09FF)
  • Gujarati (0A80-0AFF)
  • Tamil (0B80-0BFF)
  • Telugu (0C00-0C7F)
  • Kannada (0C80-0CFF)
  • Malayalam (0D00-0D7F)
  • Thai (0E00-0E7F)
  • Lao (0E80-0EFF)
  • Tibetan (0F00-0FFF)
  • Myanmar (1000-109F)
  • Georgian (10A0-10FF)
  • Hangul Jamo (1100-11FF)
  • Ethiopic (1200-137F)
  • Cherokee (13A0-13FF)
  • Unified Canadian Aboriginal Syllabics (1400-167F)
  • Ogham (1680-169F)
  • Runic (16A0-16FF)
  • Tagalog (1700-171F)
  • Khmer (1780-17FF)
  • Mongolian (1800-18AF)
  • Phonetic Extensions (1D00-1D7F)
  • Latin Extended Additional (1E00-1EFF)
  • Greek Extended (1F00-1FFF)
  • General Punctuation (2000-206F)
  • Superscripts and Subscripts (2070-209F)
  • Currency Symbols (20A0-20CF)]
  • Letterlike Symbols (2100-214F)
  • Number Forms (2150-218F)
  • Arrows (2190-21FF)
  • Mathematical Operators (2200-22FF)
  • Miscellaneous Technical (2300-23FF)
  • Control Pictures (2400-243F)
  • Optical Character Recognition (2440-245F)
  • Enclosed Alphanumerics (2460-24FF)
  • Box Drawing (2500-257F)
  • Block Elements (2580-259F)
  • Geometric Shapes (25A0-25FF)
  • Miscellaneous Symbols (2600-26FF)
  • Dingbats (2700-27BF)
  • Miscellaneous Mathematical Symbols-A (27C0-27EF)
  • Supplemental Arrows-A (27F0-2a7FF)
  • Braille Patterns (2800-28FF)
  • Supplemental Arrows-B (2900-297F)
  • Katakana (30A0-30FF)
  • Bopomofo (3100-312F)
  • Hangul Compatibility Jamo (3130-318F)
  • Kanbun (3190-319F)
  • Bopomofo Extended (31A0-31BF)
  • Katakana Phonetic Extensions (31F0-31FF)
  • Enclosed CJK Letters and Months (3200-32FF)
  • CJK Compatibility (3300-33FF)
  • CJK Unified Ideographs Extension A (3400-4DBF)
  • Yijing Hexagram Symbols (4DC0-4DFF)
  • CJK Unified Ideographs (4E00-9FFF)
  • Yi Syllables (A000-A48F)
  • Yi Radicals (A490-A4CF)
  • Hangul Syllables (AC00-D7AF)
  • Surrogates and Private Use Area (D800-F8FF)
  • CJK Compatibility Ideographs (F900-FAFF)
  • Alphabetic Presentation Forms (FB00-FB4F)
  • Arabic Presentation Forms-A (FB50-FDFF)
  • Variation Selectors (FE00-FE0F)
  • Combining Half Marks (FE20-FE2F)
  • CJK Compatibility Forms (FE30-FE4F)
  • Small Form Variants (FE50-FE6F)
  • Arabic Presentation Forms-B (FE70-FEFF)
  • Halfwidth and Fullwidth Forms (FF00-FFEF)
4.6.2 Supplementary Multilingual Plane (SMP) U 1000 - U 1ffff Mostly used for historic scripts and musical and mathematical symbols.

Some of the principal blocks are:-

  • Linear B Syllabary (10000-1007F)
  • Linear B Ideograms (10080-100FF)
  • Aegean Numbers (10100-1013F)
  • Old Italic (10300-1032F)
  • Gothic (10330-1034F)
  • Ugaritic (10380-1039F)
  • Deseret (10400-1044F)
  • Shavian (10450-1047F)
  • Osmanya (10480-104AF)
  • Cypriot Syllabary (10800-1083F)
  • Byzantine Musical Symbols (1D000-1D0FF)
  • Musical Symbols (1D100-1D1FF)
  • Tai Xuan Jing Symbols (1D300-1D35F)
  • Mathematical Alphanumeric Symbols (1D400-1D7FF)
4.6.3 Higher Planes There are further planes in Unicode, providing code points (many not yet assigned) up to U 10FFFF.

5 Unicode Storage

5.1 Introduction

As we have seen, ASCII characters were traditionally stored in a byte per character; the system was very efficient in terms of space, very easy to program (easy to count characters for example) but unable to represent higher value code points.

This section explains the various storage techniques available to developers today.

5.2 UTF-32 - "The Simplest Method"

UTF-32 is simply an extension of the above byte-per-character ASCII storage method. In UTF-32, each character is stored as a 32 bit value.

If storage space and transmission times were of no importance in applications and on the Internet, every text string could be stored and transmitted in UTF-32.

A system using UTF-32 is very easy to program and maintain, but very inefficient in terms of storage. A Web page of 1000 ASCII characters, for example, would require a 4000 byte download, rather than the 1000 bytes we would expect (these figures ignore potential web server compression such as gzip).

5.3 UTF-16

5.3.1 Overview UTF-16 is widely used in today’s environments, and offers the developer a virtually seamless means to use Unicode text strings. It offers simplicity and relatively efficient space encoding, although if you are working only characters in the code point range 0 - U ff, the text will require twice the storage space of a simple byte-per-character system.

If you are a Java, or .NET (C#, Visual Basic) developer or a Windows C programmer working in Unicode you are using UTF-16 when you manipulate text within your code.

Most of the Unicode characters you will be likely to use are found in the range 0-0xffff, the Basic Multilingual Plane (BMP). These code points can be represented by a single 16 bit character.

In the event that you need to store a Unicode character outside of the BMP, the character is represented as a surrogate pair of 16 bit characters. Some of the above languages have some support for surrogate pairs in the event that you would ever use a Unicode character outside of the BMP.

5.4 USC-2

You may hear the terms UCS-2 and UTF-16 used interchangeably. UCS-2 is a 16 bit fixed width character encoding system, identical to UTF-16 except that it does not support surrogate pairs; UCS-2 cannot handle Unicode characters outside of the BMP (above U FFFF).

5.5 UTF-8

5.5.1 UTF-8 Encoding

Many developers will be familiar with the term UTF-8.. This is a highly space efficient means to store a sequence of any Unicode characters. Put simply, it uses as few bytes as possible. Instead of using a fixed 2 or 4 bytes per character as in UTF-16 and UTF-32, it attempts to use 1, 2, 3 or 4 bytes.

Without giving a complete technical explanation:-

  • ASCII characters 0-127 (decimal) are encoded as a single byte. Thus, English text looks identical in UTF-8; this is especially useful when using a debugger.
  • Code points U 80 to U 7FF are stored as 2 bytes.
  • Code points U 800 to U FFFF are stored as 3 bytes.
  • Code points U 10000 U 10FFFF are stored as 4 bytes.
Thus, all Unicode characters can be stored very efficiently, with the most common characters being stored in 1 or 2 bytes.

The disadvantage is the relative complexity in handling the text internally given the different lengths of character encodings.

5.5.2 Web Pages

If, for example, a web page is encoded using UTF-8, the web server responds to HTTP requests with a UTF-8 encoded text string. The number of bytes to be transmitted is reduced in comparison to UTF-16 or UCS-2 encoding. The web browser knows how to decode UTF-8, and each character is re-assembled in the web browser as the correct Unicode code point.

5.5.3 An Encoding Example

Returning to our earlier example of the £ sterling character, code point U A3  (or 163 decimal), the UTF-8 encoding for this character is the 2 byte sequence:-

0xC2 0xA3
It is not uncommon to see the £ character incorrectly displayed on a web page. The reason is usually either:-
  • The web page contains UTF-8 data but the character set is described as ISO-8859-1, so two ISO-8859 characters are displayed, in error.
  • The character set is described as UTF-8 but the character has not been UTF-8 encoded; the browser attempts to decode the text starting at our single byte as a UTF-8 sequence.

6 Code Examples

6.1 HTML

Unicode characters are easily inserted into a web page, such as:-

<p> &#x039C; &#x03B5; &#x03C6;</p>

…which, if the web page is correctly set to a Unicode character set (such as UTF-8), and a suitable font is present in the client system, will produce some Greek characters.

6.2 .NET C#

6.2.1 Looking at UTF-8 Encoding

Consider the following code snippet to create a String which will be written into a web page whose charset is UTF-8:-

String test = "Velocit" (char)0xE0;

We have created a String using ASCII characters but appended the Unicode code point U E0 which is the character à. It may have been simpler to have defined the entire word using non-ASCII characters in the source code, but these can be difficult to type.

The code below shows how to view our String as UTF-8 bytes:-

System.Text.UTF8Encoding encoding = new System.Text.UTF8Encoding();

byte[] br = encoding.GetBytes(test);

int i = 0;

while (i < br.Length)

{

Console.WriteLine("Byte " i "=" br[i].ToString("x2"));

i ;

}

The String Velocità is encoded as the ASCII values for all but the final character which as can be seen is in the range U 80 – U 7FF and is thus encoded as two bytes (0xC3 0xA0) under UTF-8.

Whilst the developer does not normally need to be aware of the mechanism of UTF-8, in can become useful when inserting text from a database, for example; some conversion may need to be executed.

6.2.2 Counting Characters

Strings are stored as UTF-16 characters in .NET. This fact is seldom apparent to the developer; the .NET String methods are powerful and allow simple text manipulation.

Consider the following simple code snippet to create a String and count the number of characters:-

String test = "Clef”;

Console.WriteLine("Number of characters=" test.Length);

The number of characters is correctly reported as 4.

If we change the String to include a non-BMP character, in this case the treble clef  (U 1d11e)  we will see a problem. The new character is encoded as two UTF-16 characters, but the String class does not “know” that the two surrogate UTF-16 characters are actually one; thus the following  code returns not 5 characters but 6:-

String test = "clef" Char.ConvertFromUtf32(0x1d11e);

Console.WriteLine("Number of characters=" test.Length);

(Note also the means by which the UTF-32 character 0x1D11E is added to the String.)

To allow the correct counting of characters when surrogate pairs are in use, the following code illustrates the necessary technique in a simple function:-

//return number of characters in a String even if surrogate pairs are //present int CountChars(String text) { int test_len = text.Length; int i = 0;

int num_chars = 0;

while (i < test_len) {

if (Char.IsSurrogate(text, i) == true) { i ;            //skip a character so that the surrogate pair //is counted once. }

num_chars ; i ; } return (num_chars); }

It is probably unlikely that the above situation will occur as most Unicode characters that are likely to be used lie in the Basic Multilingual Plane (0- U FFFF).

6.3 C

6.3.1 Unicode in C

Many C programmers begin text implementation as char arrays of ISO-8859-1 characters. There are a variety of means to implement Unicode in C .

6.3.2 Use Wide Characters Explicitly

It is a relatively simple switch from char to the 16-bit “wide” character  wchar_t.Strings are defined using the L prefix, and the ASCII text functions str??? are replaced in usage by their wide character equivalents wcs???. The following code example shows  the definition and use of a Unicode string in C . Note that wcslen returns the number of characters, not the length of the string in bytes.

const wchar_t test[] = L"Velocità";

printf("Num chars = %d\r\n", wcslen(test));

The advanced code example below shows how to convert any Unicode code point (even outside the BMP) into a single UTF-16 character, or a surrogate pair.

// convert a single unicode char to UTF-16 // note that if src>0xffff it returns two 16 bit wide chars // (surrogate pairs) // // if the char is in the BMP, c1 will always be 0, so only c0 // should be used

void ToUTF16(wchar_t &c0,wchar_t &c1,const unsigned long src) {

if(src > 0xffff) {

// 0x10000 is subtracted from the code point, leaving a 20 bit number // in the  range 0..0xFFFFF.

// The top ten bits (a number in the range 0..0x3FF) are added to 0xD800 to give // the first code point or high surrogate, which will be in the range // 0xD800..0xDBFF. // The low ten bits (also in the range 0..0x3FF) are added to 0xDC00 to give // the second code point or low surrogate, which will be in the range // 0xDC00..0xDFFF.

unsigned long v1 = src - 0x10000;

unsigned long v2 = v1 & 0x3ff;       //lower 10 bits

v2 = 0xdc00;

//get the top 10 bits

v1 >>= 10; v1 &= 0x3ff; v1 = 0xd800;

//add the surrogate pair

c0 = (wchar_t)v1; c1 = (wchar_t)v2;

} else {

//BMP char (basic multilingual plane) c0 = (wchar_t)src; c1 = 0; } }

6.3.3 Use the Standard Library

Many C programmers prefer to use the standard library (in preference to MFC, for example). The 8 bit string class std::string has a wide character equivalent std::wstring.

6.3.4 Define _UNICODE

Windows C programmers (not CLR) can opt to define their project as Unicode. The use of pre-processor directives can make the switch to Unicode within a project very simple, and can even allow for a reversion to 8-bit characters.

In Visual Studio (or Visual C before that) when a project is defined as using Unicode characters (the default):-

  • TCHAR is defined as WCHAR  which is itself defined as the wide character wchar_t
  • Much of the Windows API uses TCHAR to reference Strings; in effect, the API uses wide characters.
  • The MFC class CString now becomes a Unicode vessel.
  • Windows  provides the TEXT macro as a means to define text literals in whichever encoding system is currently in use.

7 References

For further information on Unicode please visit The Unicode Consortium at:-

https://www.unicode.org