Spell Checking or Search Engine Suggestions using BK-Trees


A search facility on a website or in an application should be able to function when minor spelling errors are present in the text entered by the human user. The facility may either automatically execute the search, using the corrected words, or may return one
or more suggestions to the end user.

The Problem

Whilst a database index may store keys that enable millions of records to be swiftly retrieved by means of a perfect match between the entered text words and the contents of that database, providing means to retrieve data where minor spelling errors are present
cannot be satisfied by a regular database index. For example, the word Leicester should be returned if a user enters:-

  1. liecester
  2. leicestre
  3. lecester

Each of the above entries contains a minor error, and entry (3) has a missing letter.

It will be noted from example (2) above, that one cannot assume that the first 3 or 4 letters are correct, and provide a wild card facility after the third or fourth character; this approach will limit the power of a fuzzy matching or spelling correction engine.

String Distance Calculations

Inherent in the solution presented below, is the ability to compute the difference between text strings as a numerical value, or distance.

I currently use the Levenshtein distance formula to calculate the distance between strings. The Levenshtein distance returns the number of changes required to transform string a to string b; for example the distance between ‘bristok’ and ‘bristol’ is 1 (change ‘k’ to ‘l’) and the distance between ‘bristle’ and ‘bristol’ is 2 (remove ‘e’ and insert ‘o’ before ‘l’).


The BK-tree was invented by Walter Austin Burkhard and Robert M. Keller. It is a metric tree specifically adapted to discrete metric spaces.

A loose description which is easier to understand is that the BK-tree is a means to greatly sub-divide a large ‘bag’ of objects into smaller collections in such a way that a search operation can easily find the correct collection and from there quickly find the desired matches.

Unlike a binary tree or a multi-branch B*Tree, a BK-tree is suitable to allow fuzzy string matching.

Creating a BK-tree

The creation of a BK-tree is quite simple. The root node is an arbitrary entry from our ‘bag’ of words. The actual strings are added as lowercase, and all searching would normally be case-insensitive.

Our words in this example, in no particular order, are:-

  • Leeds
  • York
  • Bristol
  • Leicester
  • Hull
  • Durham

The root node is set to ‘Leeds’:-

BK-tree example small tree, the root node is added, text='Leeds'

The nodes can have multiple branches, each branch being a string distance value between the node’s text, and the text of the child node; this will be seen below.

To add the first child, ‘York’, calculate the string distance between it can the root node’s text, ‘Leeds’. The Levenshtein distance is 5. Create a child node with the value ‘York’, linked from the root node with a distance value of 5.

BK-tree example small tree, the second node is added, text='York'; this is a child of the root node 'Leeds'.

The next text to be added is ‘Bristol’. Repeating the procedure above by calculating the distance between ‘Bristol’ and ‘Leeds’, the value is 7.Next, check that there are no existing branches of value 7. There are none, so repeat the procedure as performed above, creating a new node and linking to the root node:-

BK-tree example, the third node is added, text='Bristol'. This is a child of the root node.

The fourth text to be added is ‘Leicester’. The procedure is identical to the second and third values, as above. The Levenshtein distance between ‘Leicester’ and the root node ‘Leeds’ is 6, so a new node is linked as shown below:-

BK-tree example, the fourth node is added, text ='Leicester. This is a child of the root node.

The next text value is ‘Hull’. Calculate the distance between ‘Hull’ and ‘Leeds’ ; it is 5. It will be noted that there is already a branch from the root node, of distance value 5. In this event, ‘move’ down that branch and repeat the insert procedure but compare the new text value to the current node, i.e. the one to which you have just moved, in this case,’York’. The Distance between ‘Hull’ and ‘York’ is 4, and as there are no branches from ‘York’ with distance=4, now add a new node as shown below:-

BK-tree example, the fifth node is added, text='Hull'. This is a child of the 'York' node.

Note that if the ‘York’ branch already had a child to it with distance value 4, you would have moved down that branch to the child node, and repeated the above procedure until a node without a branch equal to the Levenshtein distance, is found.

Finally, add ‘Durham’ to the BK-tree, as illustrated below:-

Bk-tree example, the final node is added, text='Durham'. This is a child of the 'Leicester' node.

‘Durham’ had the Levenshtein distance from ‘Leeds’ of 6, so it was necessary to move down the branch to ‘Leicester’ and add it to that node.

The small BK-tree is now complete. Here’s how to search it.

Searching a BK-tree

The procedure for searching a BK-tree, to obtain a list of approximate matches to a word, is quite simple:-

  • Set the ‘current node’ to the root node.
  • Invoke a ‘search node’ function on the current node, passing the search word and maximum permitted Levenshtein distance from that word to the matches (max).
  • The ‘search node’ function does the following:-
    1. Calculate dist = distance from the current node’s text to the search word.
    2. If dist <= max, add current node’s text to the final text list.
    3. Recursively invoke the search node function on any child nodes of the current node where the distance of the branch is in the inclusive range dist – max to dist + max.
  • When the final invocation of the ‘search node’ function has returned, return the list of fuzzy matches found (if any).

A Worked Example

These are the detailed steps to search the above tree for the input text ‘Hill’ with a maximum allowed string distance of 1, for fuzzy matches to ‘Hill’.

The worked example is illustrated below. Note that the ‘Bristol’ branch from the root, which is not used during this search, is greyed out.

BK-tree worked example searching for fuzzy matches for 'Hill'

  1. Set the current node to the root node ‘Leeds’.
  2. Calculate the Levenshtein distance from the input text ‘Hill’ to the root node ‘Leeds’; the distance is 5.
  3. If the distance is less than or equal to maximum allowed distance, add the current node’s text to the final text list; it is not.
  4. Follow any branches from the current node, whose branch distance links are in the inclusive range distance – max to distance + max, in this case 4,5 and 6. There is no 4 branch, so go down the 5 and 6 branches.
  5. It will be noted that the ‘Bristol’ branch is not used. Typically, at each node, one or more child branches are not followed.
  6. Set the current node to ‘York’.
  7. Repeat as at stage 2; calculate the distance between the search text ‘Hill’ and the current node ‘York’. The distance is 4. If the distance is less than or equal to max, add the current node’s text to the final list. Follow branches from the ‘York’ node in the range distance – max to distance + max, that is 3,4 and 5.
  8. Follow any branches from ‘York’ with distances 3,4 and 5; there is only one, at distance 4. Set the current node to the ‘Hull’ node and execute the ‘search node’ function.
  9. Repeat as at stage 2; calculate the distance between the search text ‘Hill’ and the current node’s text value ‘Hull’. The distance is 1, which is within the maximum allowed range, so add the current node’s text value ‘Hull’ to the final list.
  10. There are no branches from the ‘Hull’ node, so exit the ‘search node’ function.
  11. This will return the current node to the previous node, ‘York’. Follow any more branches in the range 3,4,5 (as at stage 8). There are none, so exit the ‘search node’ function.
  12. This will return the current node to the previous node, the ‘Leeds’ node. Follow any further branches in the range 4,5 and 6 as stated at stage 4. The next branch to be followed is distance 6. Set the current node to ‘Leicester’, and invoke the ‘search node’ function.
  13. Hopefully the next stage will be clear. Test the current node’s text against the search test, by calculating the Levenshtein distance. If the distance is 1 or less, add the current node’s text to the final list, and follow any child branches within the range as described above.
  14. When the final invocation of ‘search node’ has completed, exit with the list of matches, in this case just a single match, ‘Hull’.

Google Android Operatng System Logo - a medium green robotAndroid Versions for Android Virgins


Google’s Android Operating System for mobile devices has now evolved through four major versions. This article briefly describes the major features of the principle releases of Android 1-4.

From Android 1.5 onwards, each release has been named.

Android 1.0

Android was born in September 2008 with the release of version 1.0. At that time, Android releases were un-named.

The first Android device was the HTC Dream.

Android 1.5 Cupcake

Android 1.5 Cupcake Logo

Released in April 2009, Cupcake introduced some familiar features to the Android world:-

  • Uploading of videos and photos to YouTube and Picassa respectively
  • MPEG-4 (mp4) recording and playback
  • Widget support
  • Auto screen rotation
  • Animated screen transitions

Android 1.6 Donut

Android 1.6 Cupcake Logo

Version 1.6 was released a year after the first Android release, and included:-

  • Support for 800×480 screen resolution
  • Ability for developers to include their content in search results
  • Android Marketplace inprovements, including app screenshots
  • Performance improvements to searching and camera applications

Android 2.0-2.1 Eclair

Android 2 Eclair Logo

Android 2 was introduced in October 2009 with the release of “Eclair”.

Android 2.2 Froyo

Android 2.2 Froyo Logo

In May 2010, 2.2 was released, codenamed “Froyo”. Features included:-

  • Speed, memory, and performance optimizations
  • Adobe Flash support
  • Wi-fi hotspots support
  • Impoved Microsoft Exchange support
  • Numeric and alphanumeric passwords

Android 2.3x Gingerbread

Android 2.3x Gingerbread Logo

This is currently (May 2012) probably the most common pre-installed OS on Android phones and some tablets.

Introduced in December 2010, this version’s features include:-

  • Updated and improved UI
  • Support for larger screen resolutions (1280×768 and above)
  • Improvements to garbage collection and power management
  • Multiple camera support
  • Native VOIP support

Some of the latest devices which are pre-installed with Gingerbread can now be updated to Ice Cream Sandwich (such as the Samsung Galaxy S2).

Android 3.x – Honeycomb

Android 3.x Honeycomb Logo

Android 3 is exclusively for tablet devices, though some tablets continue to be pre-installed with Android 2.xx releases even after Honeycomb’s introduction in February 2011.

Android 4.0.x – Ice Cream Sandwich

Android 4.x Ice Cream Sandwich Logo

The latest release of the Android Operating System, Ice Cream Sandwich is now pre-installed on the newest Android phones, such as the Galaxy S3.

Android 4 was released in October 2011 and includes:-

  • Enhanced speed and performance
  • Virtual buttons in the UI, in place of capacitive or physical buttons
  • Ability to access apps directly from lock screen
  • New tabbed web browser
  • 1080p video recording
  • UI hardware acceleration

PHP Shortcuts Episode 2: The ternary operator

The ternary operator in PHP is basically a shorthand way of writing conditional statements, for which we would usually use the familiar if and else etc.

For example, in a program that decides what I’m going to do in my lunch hour:

[sourcecode language=”php”]
if($is_friday) {
$activity = ‘go to the pub’;
} else {
$activity = ‘stay at work’;

We could write this with less code (less code is better, right?)

[sourcecode language=”php”]
$activity = $is_friday ? ‘go to the pub’ : ‘stay at work’;

So, what happens here is if $is_friday equates to TRUE, the value after the question mark (‘go to the pub’) is assigned to the variable $activity. If $is_friday equates to FALSE then the value after the semi-colon (‘stay at work’) is assigned to $activity instead.

The other neat thing about the ternary operator is we can use it as part of an echo statement to decide what we’re doing and output it to the page in one easy line of code like so:

It's my lunch hour and today I'm going to <?php echo $is_friday ? 'go to the pub' : 'stay at work'; ?>.

Hooray for ternary operators, unfortunately it’s Thursday today so I’ll be staying at work!

PHP Shortcuts Episode 1: Simple form/query string values

Hello and welcome to my new mini blog series on PHP shortcuts. I’ll be posting some handy code snippets that I’ve used over the years that will save you both time and possibly also from Repetitive Strain Injury!

First off – A quick function to get a value from both the $_GET or $_POST arrays without potentially throwing errors by forgetting to check if your particular array key exists before trying to access it.

The old, long-winded way:

[sourcecode language=”php”]

if(isset($_GET[‘first_name’])) {
$first_name = $_GET[‘first_name’];
} else if(isset($_POST[‘first_name’])) {
$first_name = $_POST[‘first_name’];
} else {
$first_name = NULL;

if(isset($_GET[‘last_name’])) {
$last_name = $_GET[‘last_name’];
} else if(isset($_POST[‘last_name’])) {
$last_name = $_POST[‘last_name’];
} else {
$last_name = NULL;


The new, super-quick and easy way:

[sourcecode language=”php”]

function get_post($p) {
if(isset($_GET[$p])) {
return $_GET[$p];
} else if(isset($_POST[$p])) {
return $_POST[$p];
} else {
return NULL;

$first_name = get_post(‘first_name’);
$last_name = get_post(‘last_name’);


Job done! See you next time.

A Beginner’s Guide to Understanding Unicode for the Application Programmer and Web Developer.

1 Introduction

To developers who speak and work in English, French, Spanish and other Western European languages, Unicode character encoding is often either ignored, considered late or even retro-fitted to projects, whilst the development team may have only a rudimentary understanding of Unicode and character encoding systems such as UTF-8.

The concepts are quite straightforward and can easily be grasped and whilst a detailed knowledge of, for example, the encoding system used by UTF-8 is not necessary, a general understanding of Unicode and how it is implemented in today’s applications and web servers is useful.

2 ISO-8859-1 and ASCII

2.1 Background

Before looking at Unicode let us examine a simple 8 bit byte-per character encoding scheme.

ISO-8859-1 is an 8-bit single byte character encoding scheme for Western European languages.

The first 128 characters of ISO-8859-1 is the original ASCII character-set (the numbers from 0-9, the uppercase and lowercase English alphabet, and some special characters).

The higher part of ISO-8859-1 (codes from 160-255) contains the characters used in Western European countries and some commonly used special characters.

ISO-8859-1 thus provides character encoding for English and many Western European languages. Character encoding is simple (a character is a single byte) and space efficient, but the scope is limited to 255 possible characters.

2.2 Example Use

In web server usage, if a web page will not contain any non ISO-8859-1 characters, the web server character set may be set to ISO-8859-1;  a generic method is shown below:-

[sourcecode language=”html”]
<meta http-equiv="content-type"
content="text/html; charset=iso-8859-1" />

A JSP page fragment is shown below:-

[sourcecode language=”java”]
<%@ page contentType="text/html;charset= iso-8859-1" %>

2.3 Example Character

An example ISO-8859-1 character is the pound sterling £character, which is code 163 decimal or 0xA3 hexadecimal.

3 Bytes and Characters Confusion

A byte is (almost always) an 8-bit unit of information.  Historically, however, ASCII text strings were represented as a single byte per character, and the word “character” was used as a synonym for “byte”.

Furthermore, the ANSI C byte storage unit is char and since this was used to store a single ASCII character, the terms byte, char and character were often used interchangeably in documentation.

Today’s developer should consider bytes and characters as distinct and different units at all times.

4 How Unicode is Organised

4.1 Introduction

Unicode extends the range of code points far beyond the 8 bit limits of ISO-8859-1 and byte storage. Unicode characters have been assigned for virtually all spoken and written languages, plus mathematical and other symbols, and historic languages.

4.2 Code Points

A code point is simply the numerical value that defines a given character within the Unicode definition.

4.3 ASCII in Unicode

The ASCII characters, with which English speaking developers are so familiar, are of course a subset of the Unicode code point range.

The code points 0-U+7F (or 0-127 decimal) are ASCII characters; an ASCII code is thus identical to its Unicode code point value.

4.4 Unicode Planes

A Unicode plane is a collection of 216 (65536 or 0x10000 hex) code points.

The first plane, that is the code points 0 – U+FFFF, is called the Basic Multilingual Plane (BMP). This comprises the most Unicode characters that have been assigned so far. The BMP contains characters for almost all modern languages, and a large number of special characters.

It will be noted that the BMP can be represented in 16 bits; in other words, the Unicode characters that are used in most instances are able to be represented in 16 bits.

4.5 Example Unicode Characters

  • A (U+41)
  • £ (U+A3)
  • χ (U+3c7)
  •  Treble Clef (U+1d11e). Note that this lies outside of the Basic Multilingual Plane (BMP).

4.6 Unicode Character Pages (Partial List)

These are some of the principal character page locations within Unicode. This is not intended to be a comprehensive list.

4.6.1. Basic Multilingual Plane (BMP) 0 – U+FFFF
Most of the Unicode characters assigned lie in the BMP. These are just a few of the blocks of code points:-

  • C0 Controls and Basic Latin (0000-007F)
  • C1 Controls and Latin-1 Supplement (0080-00FF)
  • Latin Extended-A (0100-017F)
  • Latin Extended-B (0180-024F)
  • Greek and Coptic (0370-03FF)
  • Cyrillic (0400-04FF)
  • Armenian (0530-058F)
  • Hebrew (0590-05FF)
  • Arabic (0600-06FF)
  • Syriac (0700-074F)
  • Bengali (0980-09FF)
  • Gujarati (0A80-0AFF)
  • Tamil (0B80-0BFF)
  • Telugu (0C00-0C7F)
  • Kannada (0C80-0CFF)
  • Malayalam (0D00-0D7F)
  • Thai (0E00-0E7F)
  • Lao (0E80-0EFF)
  • Tibetan (0F00-0FFF)
  • Myanmar (1000-109F)
  • Georgian (10A0-10FF)
  • Hangul Jamo (1100-11FF)
  • Ethiopic (1200-137F)
  • Cherokee (13A0-13FF)
  • Unified Canadian Aboriginal Syllabics (1400-167F)
  • Ogham (1680-169F)
  • Runic (16A0-16FF)
  • Tagalog (1700-171F)
  • Khmer (1780-17FF)
  • Mongolian (1800-18AF)
  • Phonetic Extensions (1D00-1D7F)
  • Latin Extended Additional (1E00-1EFF)
  • Greek Extended (1F00-1FFF)
  • General Punctuation (2000-206F)
  • Superscripts and Subscripts (2070-209F)
  • Currency Symbols (20A0-20CF)]
  • Letterlike Symbols (2100-214F)
  • Number Forms (2150-218F)
  • Arrows (2190-21FF)
  • Mathematical Operators (2200-22FF)
  • Miscellaneous Technical (2300-23FF)
  • Control Pictures (2400-243F)
  • Optical Character Recognition (2440-245F)
  • Enclosed Alphanumerics (2460-24FF)
  • Box Drawing (2500-257F)
  • Block Elements (2580-259F)
  • Geometric Shapes (25A0-25FF)
  • Miscellaneous Symbols (2600-26FF)
  • Dingbats (2700-27BF)
  • Miscellaneous Mathematical Symbols-A (27C0-27EF)
  • Supplemental Arrows-A (27F0-2a7FF)
  • Braille Patterns (2800-28FF)
  • Supplemental Arrows-B (2900-297F)
  • Katakana (30A0-30FF)
  • Bopomofo (3100-312F)
  • Hangul Compatibility Jamo (3130-318F)
  • Kanbun (3190-319F)
  • Bopomofo Extended (31A0-31BF)
  • Katakana Phonetic Extensions (31F0-31FF)
  • Enclosed CJK Letters and Months (3200-32FF)
  • CJK Compatibility (3300-33FF)
  • CJK Unified Ideographs Extension A (3400-4DBF)
  • Yijing Hexagram Symbols (4DC0-4DFF)
  • CJK Unified Ideographs (4E00-9FFF)
  • Yi Syllables (A000-A48F)
  • Yi Radicals (A490-A4CF)
  • Hangul Syllables (AC00-D7AF)
  • Surrogates and Private Use Area (D800-F8FF)
  • CJK Compatibility Ideographs (F900-FAFF)
  • Alphabetic Presentation Forms (FB00-FB4F)
  • Arabic Presentation Forms-A (FB50-FDFF)
  • Variation Selectors (FE00-FE0F)
  • Combining Half Marks (FE20-FE2F)
  • CJK Compatibility Forms (FE30-FE4F)
  • Small Form Variants (FE50-FE6F)
  • Arabic Presentation Forms-B (FE70-FEFF)
  • Halfwidth and Fullwidth Forms (FF00-FFEF)

4.6.2 Supplementary Multilingual Plane (SMP) U+1000 – U+1ffff
Mostly used for historic scripts and musical and mathematical symbols.

Some of the principal blocks are:-

  • Linear B Syllabary (10000-1007F)
  • Linear B Ideograms (10080-100FF)
  • Aegean Numbers (10100-1013F)
  • Old Italic (10300-1032F)
  • Gothic (10330-1034F)
  • Ugaritic (10380-1039F)
  • Deseret (10400-1044F)
  • Shavian (10450-1047F)
  • Osmanya (10480-104AF)
  • Cypriot Syllabary (10800-1083F)
  • Byzantine Musical Symbols (1D000-1D0FF)
  • Musical Symbols (1D100-1D1FF)
  • Tai Xuan Jing Symbols (1D300-1D35F)
  • Mathematical Alphanumeric Symbols (1D400-1D7FF)

4.6.3 Higher Planes
There are further planes in Unicode, providing code points (many not yet assigned) up to U+10FFFF.

5 Unicode Storage

5.1 Introduction

As we have seen, ASCII characters were traditionally stored in a byte per character; the system was very efficient in terms of space, very easy to program (easy to count characters for example) but unable to represent higher value code points.

This section explains the various storage techniques available to developers today.

5.2 UTF-32 – “The Simplest Method”

UTF-32 is simply an extension of the above byte-per-character ASCII storage method. In UTF-32, each character is stored as a 32 bit value.

If storage space and transmission times were of no importance in applications and on the Internet, every text string could be stored and transmitted in UTF-32.

A system using UTF-32 is very easy to program and maintain, but very inefficient in terms of storage. A Web page of 1000 ASCII characters, for example, would require a 4000 byte download, rather than the 1000 bytes we would expect (these figures ignore potential web server compression such as gzip).

5.3 UTF-16

5.3.1 Overview
UTF-16 is widely used in today’s environments, and offers the developer a virtually seamless means to use Unicode text strings. It offers simplicity and relatively efficient space encoding, although if you are working only characters in the code point range 0 – U+ff, the text will require twice the storage space of a simple byte-per-character system.

If you are a Java, or .NET (C#, Visual Basic) developer or a Windows C++ programmer working in Unicode you are using UTF-16 when you manipulate text within your code.

Most of the Unicode characters you will be likely to use are found in the range 0-0xffff, the Basic Multilingual Plane (BMP). These code points can be represented by a single 16 bit character.

In the event that you need to store a Unicode character outside of the BMP, the character is represented as a surrogate pair of 16 bit characters. Some of the above languages have some support for surrogate pairs in the event that you would ever use a Unicode character outside of the BMP.

5.4 USC-2

You may hear the terms UCS-2 and UTF-16 used interchangeably. UCS-2 is a 16 bit fixed width character encoding system, identical to UTF-16 except that it does not support surrogate pairs; UCS-2 cannot handle Unicode characters outside of the BMP (above U+FFFF).

5.5 UTF-8

5.5.1 UTF-8 Encoding

Many developers will be familiar with the term UTF-8.. This is a highly space efficient means to store a sequence of any Unicode characters. Put simply, it uses as few bytes as possible. Instead of using a fixed 2 or 4 bytes per character as in UTF-16 and UTF-32, it attempts to use 1, 2, 3 or 4 bytes.

Without giving a complete technical explanation:-

  • ASCII characters 0-127 (decimal) are encoded as a single byte. Thus, English text looks identical in UTF-8; this is especially useful when using a debugger.
  • Code points U+80 to U+7FF are stored as 2 bytes.
  • Code points U+800 to U+FFFF are stored as 3 bytes.
  • Code points U+10000 U+10FFFF are stored as 4 bytes.

Thus, all Unicode characters can be stored very efficiently, with the most common characters being stored in 1 or 2 bytes.

The disadvantage is the relative complexity in handling the text internally given the different lengths of character encodings.

5.5.2 Web Pages

If, for example, a web page is encoded using UTF-8, the web server responds to HTTP requests with a UTF-8 encoded text string. The number of bytes to be transmitted is reduced in comparison to UTF-16 or UCS-2 encoding. The web browser knows how to decode UTF-8, and each character is re-assembled in the web browser as the correct Unicode code point.

5.5.3 An Encoding Example

Returning to our earlier example of the £ sterling character, code point U+A3  (or 163 decimal), the UTF-8 encoding for this character is the 2 byte sequence:-

0xC2 0xA3

It is not uncommon to see the £ character incorrectly displayed on a web page. The reason is usually either:-

  • The web page contains UTF-8 data but the character set is described as ISO-8859-1, so two ISO-8859 characters are displayed, in error.
  • The character set is described as UTF-8 but the character has not been UTF-8 encoded; the browser attempts to decode the text starting at our single byte as a UTF-8 sequence.

6 Code Examples

6.1 HTML

Unicode characters are easily inserted into a web page, such as:-

[sourcecode language=”html”]

<p> &#x039C; &#x03B5; &#x03C6;</p>

…which, if the web page is correctly set to a Unicode character set (such as UTF-8), and a suitable font is present in the client system, will produce some Greek characters.

6.2 .NET C#

6.2.1 Looking at UTF-8 Encoding

Consider the following code snippet to create a String which will be written into a web page whose charset is UTF-8:-

[sourcecode language=”csharp”]

String test = "Velocit" + (char)0xE0;


We have created a String using ASCII characters but appended the Unicode code point U+E0 which is the character à. It may have been simpler to have defined the entire word using non-ASCII characters in the source code, but these can be difficult to type.

The code below shows how to view our String as UTF-8 bytes:-

[sourcecode language=”csharp”]

System.Text.UTF8Encoding encoding = new System.Text.UTF8Encoding();

byte[] br = encoding.GetBytes(test);

int i = 0;

while (i < br.Length)


Console.WriteLine("Byte " + i + "=" + br[i].ToString("x2"));




The String Velocità is encoded as the ASCII values for all but the final character which as can be seen is in the range U+80 – U+7FF and is thus encoded as two bytes (0xC3 0xA0) under UTF-8.

Whilst the developer does not normally need to be aware of the mechanism of UTF-8, in can become useful when inserting text from a database, for example; some conversion may need to be executed.

6.2.2 Counting Characters

Strings are stored as UTF-16 characters in .NET. This fact is seldom apparent to the developer; the .NET String methods are powerful and allow simple text manipulation.

Consider the following simple code snippet to create a String and count the number of characters:-

[sourcecode language=”csharp”]

String test = "Clef”;

Console.WriteLine("Number of characters=" + test.Length);


The number of characters is correctly reported as 4.

If we change the String to include a non-BMP character, in this case the treble clef  (U+1d11e)  we will see a problem. The new character is encoded as two UTF-16 characters, but the String class does not “know” that the two surrogate UTF-16 characters are actually one; thus the following  code returns not 5 characters but 6:-

[sourcecode language=”csharp”]

String test = "clef" + Char.ConvertFromUtf32(0x1d11e);

Console.WriteLine("Number of characters=" + test.Length);


(Note also the means by which the UTF-32 character 0x1D11E is added to the String.)

To allow the correct counting of characters when surrogate pairs are in use, the following code illustrates the necessary technique in a simple function:-

[sourcecode language=”csharp”]

//return number of characters in a String even if surrogate pairs are //present
int CountChars(String text)
int test_len = text.Length;
int i = 0;

int num_chars = 0;

while (i < test_len)

if (Char.IsSurrogate(text, i) == true)
i++;            //skip a character so that the surrogate pair
//is counted once.

return (num_chars);


It is probably unlikely that the above situation will occur as most Unicode characters that are likely to be used lie in the Basic Multilingual Plane (0- U+FFFF).

6.3 C++

6.3.1 Unicode in C++

Many C++ programmers begin text implementation as char arrays of ISO-8859-1 characters. There are a variety of means to implement Unicode in C++.

6.3.2 Use Wide Characters Explicitly

It is a relatively simple switch from char to the 16-bit “wide” character  wchar_t.Strings are defined using the L prefix, and the ASCII text functions str??? are replaced in usage by their wide character equivalents wcs???. The following code example shows  the definition and use of a Unicode string in C++. Note that wcslen returns the number of characters, not the length of the string in bytes.

[sourcecode language=”cpp”]

const wchar_t test[] = L"Velocità";

printf("Num chars = %d\r\n", wcslen(test));


The advanced code example below shows how to convert any Unicode code point (even outside the BMP) into a single UTF-16 character, or a surrogate pair.

[sourcecode language=”cpp”]

// convert a single unicode char to UTF-16
// note that if src>0xffff it returns two 16 bit wide chars
// (surrogate pairs)
// if the char is in the BMP, c1 will always be 0, so only c0
// should be used

void ToUTF16(wchar_t &c0,wchar_t &c1,const unsigned long src)

if(src > 0xffff)

// 0x10000 is subtracted from the code point, leaving a 20 bit number
// in the  range 0..0xFFFFF.

// The top ten bits (a number in the range 0..0x3FF) are added to 0xD800 to give
// the first code point or high surrogate, which will be in the range
// 0xD800..0xDBFF.
// The low ten bits (also in the range 0..0x3FF) are added to 0xDC00 to give
// the second code point or low surrogate, which will be in the range
// 0xDC00..0xDFFF.

unsigned long v1 = src – 0x10000;

unsigned long v2 = v1 & 0x3ff;       //lower 10 bits

v2 += 0xdc00;

//get the top 10 bits

v1 >>= 10;
v1 &= 0x3ff;
v1 += 0xd800;

//add the surrogate pair

c0 = (wchar_t)v1;
c1 = (wchar_t)v2;


//BMP char (basic multilingual plane)
c0 = (wchar_t)src;
c1 = 0;


6.3.3 Use the Standard Library

Many C++ programmers prefer to use the standard library (in preference to MFC, for example). The 8 bit string class std::string has a wide character equivalent std::wstring.

6.3.4 Define _UNICODE

Windows C++ programmers (not CLR) can opt to define their project as Unicode. The use of pre-processor directives can make the switch to Unicode within a project very simple, and can even allow for a reversion to 8-bit characters.

In Visual Studio (or Visual C++ before that) when a project is defined as using Unicode characters (the default):-

  • TCHAR is defined as WCHAR  which is itself defined as the wide character wchar_t
  • Much of the Windows API uses TCHAR to reference Strings; in effect, the API uses wide characters.
  • The MFC class CString now becomes a Unicode vessel.
  • Windows  provides the TEXT macro as a means to define text literals in whichever encoding system is currently in use.

7 References

For further information on Unicode please visit The Unicode Consortium at:-


Drop down lists for iPhone: rolling your own

If like me, you’re fairly new to iPhone development, you might be wondering how to implement drop down lists in your app. The fact of the matter is that there is no native control provided by Apple, similar to the <select> tag you’ll know well if you have a background in web development. So, the solution is to roll our own: stick with me for a few minutes – I promise it’s not as tricky as it sounds. Here is how our finished control will look:

We’re going to use a UIButton control with a custom background image and put a label on top of it to show the currently selected option. Go ahead and add a UIButton and a UILabel to your view using the Interface Builder. If you’re not much of a designer, leave the background image for now and just use a regular button – it doesn’t really matter to get the code working as we need it.

Add an IBOutlet for your label to your view controller’s header file and implement the UIPickerViewDelegate and UIPickerViewDataSource delegates. We also need to define an NSMutableArray, a UIActionSheet and an NSInteger – We’ll deal with these and the delegates we are implementing in just a second. Lastly, declare an IBAction showSearchWhereOptions which we need to connect to our button in the Interface Builder. Our finished header file looks like this:

[sourcecode language=”objc”]

@interface MyViewController : UIViewController  <UIPickerViewDelegate, UIPickerViewDataSource> {

// label showing currently selected option

IBOutlet UILabel *searchWhere;

// data for populating picker view

NSMutableArray *searchWhereOptions;

// display picker view in an action sheet

UIActionSheet *actionSheet;

// keep track of selected option’s index

NSInteger selectedOption;


– (IBAction) showSearchWhereOptions;



Next – on to our implementation file. Implement the following methods that are required because of the delegates we have implemented.

[sourcecode language=”objc”]
– (NSInteger)numberOfComponentsInPickerView:(UIPickerView *)pickerView {

return 1;


– (NSInteger)pickerView:(UIPickerView *)pickerView numberOfRowsInComponent:(NSInteger)component {

return [searchWhereOptions count];


– (NSString *)pickerView:(UIPickerView *)pickerView titleForRow:(NSInteger)row forComponent:(NSInteger)component {

return [searchWhereOptions objectAtIndex:row];


// this method runs whenever the user changes the selected list option

– (void)pickerView:(UIPickerView *)pickerView didSelectRow:(NSInteger)row inComponent:(NSInteger)component {

// update label text to show selected option

searchWhere.text = [searchWhereOptions objectAtIndex:row];

// keep track of selected option (for next time we open the picker)

selectedOption = row;


Most of the work happens here.. In our showSearchWhereOptions method, which runs when the user taps our “button” (it is a button really, but it’s sneakily pretending to be a drop down list control).

[sourcecode language=”objc”]

– (IBAction) showSearchWhereOptions {

// create action sheet

actionSheet = [[UIActionSheet alloc] initWithTitle:nil delegate:nil cancelButtonTitle:nil destructiveButtonTitle:nil otherButtonTitles:nil];

[actionSheet setActionSheetStyle:UIActionSheetStyleBlackTranslucent];

// create frame for picker view

CGRect pickerFrame = CGRectMake(0, 40, 0, 0);

// create picker view

UIPickerView *pickerView = [[UIPickerView alloc] initWithFrame:pickerFrame];

pickerView.showsSelectionIndicator = YES;

pickerView.dataSource = self;

pickerView.delegate = self;

// set selected option to what was previously selected

[pickerView selectRow:selectedOption inComponent:0 animated:NO];

// add picker view to action sheet

[actionSheet addSubview:pickerView];

[pickerView release];

// create close button to hide action sheet

UISegmentedControl *closeButton = [[UISegmentedControl alloc] initWithItems:[NSArray arrayWithObject:@"Close"]];

closeButton.momentary = YES;

closeButton.frame = CGRectMake(260, 7.0f, 50.0f, 30.0f);

closeButton.segmentedControlStyle = UISegmentedControlStyleBar;

closeButton.tintColor = [UIColor blackColor];

// link close button to our dismissActionSheet method

[closeButton addTarget:self action:@selector(dismissActionSheet) forControlEvents:UIControlEventValueChanged];

[actionSheet addSubview:closeButton];

[closeButton release];

// show action sheet

[actionSheet showInView:[[UIApplication sharedApplication] keyWindow]];

[actionSheet setBounds:CGRectMake(0, 0, 320, 485)];



We haven’t yet defined our list items – we’ll do this in our viewDidLoad method. We need to make sure we set a default search option too. It’s a zero based index, so setting it to 0 will make “Specified location” our default selection.

[sourcecode language=”objc”]

// define list options

searchWhereOptions = [[NSMutableArray alloc] init];

[searchWhereOptions addObject:@"Specified location"];

[searchWhereOptions addObject:@"Near me"];

[searchWhereOptions addObject:@"Nationally"];

// default list option to select

selectedOption = 0;


One last thing – implement the dismissActionSheet method which is responsible for hiding the action sheet when we click the done button.

[sourcecode language=”objc”]

– (void) dismissActionSheet {

// hide action sheet

[actionSheet dismissWithClickedButtonIndex:0 animated:YES];



Single line addresses and names

Any API functions returning address information now include an addr_single_line field, which is a concatenated string of all the address parts, for your convenience.

Similarly, functions returning person information have a name_single_line field, which is a concatenated string of all the name parts.

Finding people at an address is now even easier

The address_person method (and the AddressPerson class in the Java/JSP library) has been augmented and the parameter “premises”, previously mandatory, is now optional.

If the premises is not entered, or does not indicate a unique address, the method returns a free (no charge) list of premises at which there is one or more person. This allows the final user to select an address from that list and to re-submit the method, using that text as the “premises” value.

This also assists with the difficulty of specifying a house name or number when the method is used with flats – a definitive list of available addresses is returned for free as an optional first stage.