The concepts are quite straightforward and can easily be grasped and whilst a detailed knowledge of, for example, the encoding system used by UTF-8 is not necessary, a general understanding of Unicode and how it is implemented in today’s applications and web servers is useful.
ISO-8859-1 is an 8-bit single byte character encoding scheme for Western European languages.
The first 128 characters of ISO-8859-1 is the original ASCII character-set (the numbers from 0-9, the uppercase and lowercase English alphabet, and some special characters).
The higher part of ISO-8859-1 (codes from 160-255) contains the characters used in Western European countries and some commonly used special characters.
ISO-8859-1 thus provides character encoding for English and many Western European languages. Character encoding is simple (a character is a single byte) and space efficient, but the scope is limited to 255 possible characters.
<meta http-equiv="content-type"
content="text/html; charset=iso-8859-1" />
A JSP page fragment is shown below:-
<%@ page contentType="text/html;charset= iso-8859-1" %>
Furthermore, the ANSI C byte storage unit is char and since this was used to store a single ASCII character, the terms byte, char and character were often used interchangeably in documentation.
Today’s developer should consider bytes and characters as distinct and different units at all times.
The code points 0-U 7F (or 0-127 decimal) are ASCII characters; an ASCII code is thus identical to its Unicode code point value.
The first plane, that is the code points 0 - U FFFF, is called the Basic Multilingual Plane (BMP). This comprises the most Unicode characters that have been assigned so far. The BMP contains characters for almost all modern languages, and a large number of special characters.
It will be noted that the BMP can be represented in 16 bits; in other words, the Unicode characters that are used in most instances are able to be represented in 16 bits.
4.6.1. Basic Multilingual Plane (BMP) 0 - U FFFF Most of the Unicode characters assigned lie in the BMP. These are just a few of the blocks of code points:-
Some of the principal blocks are:-
This section explains the various storage techniques available to developers today.
If storage space and transmission times were of no importance in applications and on the Internet, every text string could be stored and transmitted in UTF-32.
A system using UTF-32 is very easy to program and maintain, but very inefficient in terms of storage. A Web page of 1000 ASCII characters, for example, would require a 4000 byte download, rather than the 1000 bytes we would expect (these figures ignore potential web server compression such as gzip).
If you are a Java, or .NET (C#, Visual Basic) developer or a Windows C programmer working in Unicode you are using UTF-16 when you manipulate text within your code.
Most of the Unicode characters you will be likely to use are found in the range 0-0xffff, the Basic Multilingual Plane (BMP). These code points can be represented by a single 16 bit character.
In the event that you need to store a Unicode character outside of the BMP, the character is represented as a surrogate pair of 16 bit characters. Some of the above languages have some support for surrogate pairs in the event that you would ever use a Unicode character outside of the BMP.
Many developers will be familiar with the term UTF-8.. This is a highly space efficient means to store a sequence of any Unicode characters. Put simply, it uses as few bytes as possible. Instead of using a fixed 2 or 4 bytes per character as in UTF-16 and UTF-32, it attempts to use 1, 2, 3 or 4 bytes.
Without giving a complete technical explanation:-
The disadvantage is the relative complexity in handling the text internally given the different lengths of character encodings.
5.5.2 Web Pages
If, for example, a web page is encoded using UTF-8, the web server responds to HTTP requests with a UTF-8 encoded text string. The number of bytes to be transmitted is reduced in comparison to UTF-16 or UCS-2 encoding. The web browser knows how to decode UTF-8, and each character is re-assembled in the web browser as the correct Unicode code point.
5.5.3 An Encoding Example
Returning to our earlier example of the £ sterling character, code point U A3 (or 163 decimal), the UTF-8 encoding for this character is the 2 byte sequence:-
0xC2 0xA3It is not uncommon to see the £ character incorrectly displayed on a web page. The reason is usually either:-
<p> Μ ε φ</p>
…which, if the web page is correctly set to a Unicode character set (such as UTF-8), and a suitable font is present in the client system, will produce some Greek characters.
Consider the following code snippet to create a String which will be written into a web page whose charset is UTF-8:-
String test = "Velocit" (char)0xE0;
We have created a String using ASCII characters but appended the Unicode code point U E0 which is the character à. It may have been simpler to have defined the entire word using non-ASCII characters in the source code, but these can be difficult to type.
The code below shows how to view our String as UTF-8 bytes:-
System.Text.UTF8Encoding encoding = new System.Text.UTF8Encoding();
byte[] br = encoding.GetBytes(test);
int i = 0;
while (i < br.Length)
{
Console.WriteLine("Byte " i "=" br[i].ToString("x2"));
i ;
}
The String Velocità is encoded as the ASCII values for all but the final character which as can be seen is in the range U 80 – U 7FF and is thus encoded as two bytes (0xC3 0xA0) under UTF-8.
Whilst the developer does not normally need to be aware of the mechanism of UTF-8, in can become useful when inserting text from a database, for example; some conversion may need to be executed.
6.2.2 Counting Characters
Strings are stored as UTF-16 characters in .NET. This fact is seldom apparent to the developer; the .NET String methods are powerful and allow simple text manipulation.
Consider the following simple code snippet to create a String and count the number of characters:-
String test = "Clef”;
Console.WriteLine("Number of characters=" test.Length);
The number of characters is correctly reported as 4.
If we change the String to include a non-BMP character, in this case the treble clef (U 1d11e) we will see a problem. The new character is encoded as two UTF-16 characters, but the String class does not “know” that the two surrogate UTF-16 characters are actually one; thus the following code returns not 5 characters but 6:-
String test = "clef" Char.ConvertFromUtf32(0x1d11e);
Console.WriteLine("Number of characters=" test.Length);
(Note also the means by which the UTF-32 character 0x1D11E is added to the String.)
To allow the correct counting of characters when surrogate pairs are in use, the following code illustrates the necessary technique in a simple function:-
//return number of characters in a String even if surrogate pairs are //present
int CountChars(String text)
{
int test_len = text.Length;
int i = 0;
int num_chars = 0;
while (i < test_len)
{
if (Char.IsSurrogate(text, i) == true)
{
i ; //skip a character so that the surrogate pair
//is counted once.
}
num_chars ;
i ;
}
return (num_chars);
}
It is probably unlikely that the above situation will occur as most Unicode characters that are likely to be used lie in the Basic Multilingual Plane (0- U FFFF).
Many C programmers begin text implementation as char arrays of ISO-8859-1 characters. There are a variety of means to implement Unicode in C .
6.3.2 Use Wide Characters Explicitly
It is a relatively simple switch from char to the 16-bit “wide” character wchar_t.Strings are defined using the L prefix, and the ASCII text functions str??? are replaced in usage by their wide character equivalents wcs???. The following code example shows the definition and use of a Unicode string in C . Note that wcslen returns the number of characters, not the length of the string in bytes.
const wchar_t test[] = L"Velocità";
printf("Num chars = %d\r\n", wcslen(test));
The advanced code example below shows how to convert any Unicode code point (even outside the BMP) into a single UTF-16 character, or a surrogate pair.
// convert a single unicode char to UTF-16 // note that if src>0xffff it returns two 16 bit wide chars // (surrogate pairs) // // if the char is in the BMP, c1 will always be 0, so only c0 // should be used
void ToUTF16(wchar_t &c0,wchar_t &c1,const unsigned long src) {
if(src > 0xffff) {
// 0x10000 is subtracted from the code point, leaving a 20 bit number // in the range 0..0xFFFFF.
// The top ten bits (a number in the range 0..0x3FF) are added to 0xD800 to give // the first code point or high surrogate, which will be in the range // 0xD800..0xDBFF. // The low ten bits (also in the range 0..0x3FF) are added to 0xDC00 to give // the second code point or low surrogate, which will be in the range // 0xDC00..0xDFFF.
unsigned long v1 = src - 0x10000;
unsigned long v2 = v1 & 0x3ff; //lower 10 bits
v2 = 0xdc00;
//get the top 10 bits
v1 >>= 10; v1 &= 0x3ff; v1 = 0xd800;
//add the surrogate pair
c0 = (wchar_t)v1; c1 = (wchar_t)v2;
} else {
//BMP char (basic multilingual plane) c0 = (wchar_t)src; c1 = 0; } }
6.3.3 Use the Standard Library
Many C programmers prefer to use the standard library (in preference to MFC, for example). The 8 bit string class std::string has a wide character equivalent std::wstring.
6.3.4 Define _UNICODE
Windows C programmers (not CLR) can opt to define their project as Unicode. The use of pre-processor directives can make the switch to Unicode within a project very simple, and can even allow for a reversion to 8-bit characters.
In Visual Studio (or Visual C before that) when a project is defined as using Unicode characters (the default):-
Please refresh the page and try again.