Using an XML SOAP Web Service from Visual Studio 2005, 2008 and 2010

Introduction

Here we show how to use external SOAP web services from within Microsoft Visual Studio versions 2005, 2008 and 2010.

The scope of this article extends only to XML SOAP web services which may have been created in Visual Studio (and are usually identified by the asmx extension at the end of the URL).

In order to demonstrate the use of external web services, we will use T2A’s SOAP service, which was created using Visual Studio 2005.

All code examples here are in C#.

Making Life Easy

If you’ve never made use of an external web service from Visual Studio, you’re about to discover just how easy it is. In comparison, parsing XML yourself can be quite time consuming, even using a powerful XML reader or parser. A SOAP web service allows Visual Studio to discover how its methods operate (their inputs and return formats) and using this information, allows your IDE to create code within your project, to allow your code to seamlessly use the external service, just as if it was a class in your own code.

Using a Web Service from Visual Studio 2005

In order to simply demonstrate the use of T2A’s SOAP service, we created a C# console application. The IDE created an empty project.

We then “told” our project about the web service. We right clicked on the project in the solution explorer, and clicked on Add Web Reference. This can be viewed below, for our project ws_test_2005:-

A new window opened, in which we specified the location of the web service, in this case, https://t2a.co/soap. We clicked the go button:-

The IDE was able to identify the web service. We named the web service t2a. We then clicked on add reference as seen below.

The IDE then created the necessary code to allow our simple project to use the external web service. You can see the information about the added web reference in the right hand pane below:-

We then created a simple console application which uses the T2A soap service, specifically the geo_code method, which as its name suggests, returns geographical co-ordinates for a given postcode, street or place.

The example code shown below uses T2A’s free test mode, during which it returns dummy data at no charge, for the benefit of developers in their initial integration stages.

[sourcecode language=”csharp”]

using System;
using System.Collections.Generic;
using System.Text;

namespace ws_test_2005
{
class Program
{
static void Main(string[] args)
{
string api_key = "test";

t2a.T2A ws = new t2a.T2A();
t2a.geo_code_res ws_res = ws.geo_code(api_key, null,"york");
if (ws_res.status.CompareTo("ok") == 0)
{
//print geo code data
int i = 0;
while (i < ws_res.geo_data_list.Length)
{
Console.WriteLine(ws_res.geo_data_list[i].description +
" at (" +
ws_res.geo_data_list[i].latitude + "," +
ws_res.geo_data_list[i].longitude+")"
);

i++;
}

}

}
}
}

[/sourcecode]

When the console application is executed in the debugger, we can see the class instance that the IDE has created; in this case an array of geo_data instances, this being a member of the result class geo_code_res. Note in the bottom pane the contents of the class instances, in this case, the dummy data returned by T2A in free test mode:-

Detailed Explanation of the Code Example

Normally you would use the API key associated with your T2A account; for this example we are using T2A’s free test mode; this is activated by an API key “test”:-

string api_key = "test";

Remember that we named the external web service t2a? The code created by the IDE is in a namespace t2a. The namespace includes the main class that contains the T2A methods; this is named T2A, or more correctly, t2a.T2A.

We now create an instance of that class.

t2a.T2A ws = new t2a.T2A();

The geo_code method returns an instance of the class t2a.geo_code_res. We now invoke the method; the third parameter is the place, street or postcode for which we want the geo co-ordinates.

t2a.geo_code_res ws_res = ws.geo_code(api_key, null,"york");

If the method has succeeded, the status is ok. If not, we normally would then read the error code.

We then display the contents of the t2a.geo_data instances held in the array t2a.geo_code_res.geo_data_list.

if (ws_res.status.CompareTo("ok") == 0)
{
   //print geo code data

   int i = 0;
   while (i < ws_res.geo_data_list.Length)
   {
      Console.WriteLine(ws_res.geo_data_list[i].description +
      " at (" +
      ws_res.geo_data_list[i].latitude + "," +
      ws_res.geo_data_list[i].longitude+")"
      );

Using a Web Service from Visual Studio 2008

We created a C# console application; it actually uses the same code as shown above. When adding the web reference, the first thing one notices is that whereas formerly beneath Add Reference there was an item Add Web Reference, this seems to be missing from Visual Studio 2008. In its place is Add Service Reference.

We clicked that instead (see below).

A new window opened, named Add Service Reference. In order to add a “legacy” web reference, we clicked the advanced button at the bottom.

Yet another new window opened, named Service Reference Settings. We hit the Add Web Reference button at the bottom.

From that point, adding the web reference is the same as above, for Visual Studio 2005. We created the reference by informing the dialog box that we wished to use https://t2a.co/soap as we had with Visual Studio 2005.

Using a Web Service from Visual Studio 2010

The procedure for adding the web reference is the virtually the same for the 2010 version as for 2008; the image below shows the default Solution Explorer pane for our console application.

Line Segment Intersection for Partial or Lapsed Mathematicians

Introduction

A common problem in 2D geometry is the determination of whether two line segments intersect, and following a positive answer, optionally deriving the intersection point. The most common solution is the use of cross product to derive the intersection point (if any), and whilst to a mathematician (or student thereof) this is a relatively simple solution, research for this article quickly revealed at least two published incorrect code implementations, flawed by their author’s failure to understand that maths.

The purpose of this article is therefore to present a less elegant but much simpler solution, easier to illustrate and comprehend.

The article is therefore written with the assumption of an average knowledge of school mathematics.

Lines and Line Segments

In common parlance a 2D line is the shortest distance between two points, that description actually applies to line segments. A line is of infinite length; a line segment is a portion of the line, between two given points. Unless the two lines are parallel they will by definition intersect somewhere, given their inifinte lengths; the question of intersection of line segments is therefore the concern of this article.

The image below shows two line segments AB and CD

A Simpler Solution

To solve the line segment intersection problem we will use the fact the the intersection calculation is very simple if one of the lines is vertical or horizontal.

We will translate and rotate the “world” so that this is the case; we imagine ourselves at point A, looking along the line towards B, and considering the “world” relative to our own position and orentation.

This solution, described below in stages, also allows for some trivial (i.e. fast) checks that can determine that the line segments do not cross, before the full calculations are executed.

Stage 1 – Initial Checks

Considering again two line segments AB and CD. If the lines are identical (i.e. A and C, and B and D are at the same position (or A and D and B and C) the line segments should be designated as crossing or not crossing, subject to the ultimate objective of the test. No further checks are needed.

If both line segments are actually points (i.e. A=B and C=D), execute a simple test to determine if the the points coincide. No further checks are needed.

Finally if one of the line segment ends is the same as one of the other ends (but only one, not both), the line segments would normally be designated as not crossing. This is shown below, where B and C are coincidental.

Stage 2 – Trivial Rejection

Consider the line segments below. We can clearly see that they do not cross.

If we check the bounding boxes of each line segment, we can quickly in this case deduce that the line segments cannot cross, and no further checks are needed. The follow pseudo-code is used:-

if (box1.xmax < box2.xmin OR box1.ymax < box2.ymin OR box1.xmin > box2.xmax OR box1.ymin > box2.ymax) then reject

In the case shown below, line1’s box.xmax is less than line2’s box xmin, so they can never cross.

Stage 3 – Translate and Rotate Relative to A

Considering again the line segments AB and CD. Clearly they cross;
we will now show how to determine whether they do, and optionally calculate the point at which they do.

In order to determine whether they cross, we will rotate and shift the “world” of AB and CD so that AB is vertical, with A being at the centre (0,0). We will view the “world” from the perspective of point A, looking at point B.

First, translate A to (0,0) and move B,C and D by the same amount:-
Subtract A’s X and Y coordinates from points B,C and D:-

Now we will rotate the “world” around 0,0 so that AB lies at X=0 (on the Y axis). We will rotate both AB and CD around (0,0) by the angle between AB and the Y axis (X=0); we will designate this angle a.

Here’s where the school maths comes in. The Pythagoras formula shows that for a right angled triangle, the square of the longest edge = the sum of the square of the other two edges. Considering the line segment AB shown below:-

If we complete the right angled triangle using point E which is (A.x,B.y), the angle a is that between AB and AE. Using Pythagoras, the length or magnitude of line segment AB is derived thus:-

sqrt( (B.x-A.x)^2 + (B.y-A.y>^2) )

In order to rotate points by a given angle, we need the sine and cosine of that angle. For angle a these are:-

sin a = length(EB) / length(AB) or sin = B.x / length(AB)
cos a = length(EA) / length(AB) or cos = B.y / length(AB)

Remember that we calculated length(AB) using Pythagoras, above.

To rotate points around (0,0) by angle a, use these formulae, which you may have encountered in school maths:-

x' = x cos a - y sin a
y' = y cos a + x sin a

When the above formula is applied to AB and CD we see the result below.

(A shrewd observer will note that A (0,0) does not need to be rotated, and the new position B’ of B is (0,length(AB)).)

Remember that this is the same “world”, merely viewed from the perspetive of A, looking towards B. We will calculate the intersection point (if any) relative to point A – the fact of intersection is not changed by the co-ordinates being relative to A.

Allowing Tolerance

Even when using high precision floating point code, slight errors must be allowed for. Furthermore, although lines and line segments are normally regarded as having zero with, your application may consider them to be “thick”. For these reasons we need to add a small tolerance to some checks applied after the rotation has been applied.

We will define a small value which will be used for the tolerance. For example:-

const double small_value = 0.00001;

Stage 4 – Further Trivial Checks

The new position of C’D’ relative to AB’ allows us to easily ascertain if C’D’ crosses A’B’. We can see from the above diagram that the following pseudo-code checks can be applied:-

if C'.y < 0 AND D'.y < 0 they do not cross
if C'.y > B'.y AND D'.y > B'.y they do not cross
if C'.x < 0 AND D'.x < 0 they do not cross
if C'.x > 0 AND D'.x > 0 they do not cross

Stage 5 – Check for Co-linear

If line segment C’D’ is now on the Y axis (abs(C’.x) < small_value) AND abs(D’.x) < small_value) then the line segments lie on the same infinite line. The only questions that now concerns is whether they share the same space on that line. The stage 4 trivial checks have determined that A’B’ and C’D’ are not above or below each other, we know that if C’D’ lies at or very close to X=0 it must cross A’B’.

Stage 6 – Check for Parallel

If line segment C’D’ is now vertical ((abs(C’.x – D’.x) < small_value) we know that they cannot cross, and we also know that it is not co-linear, having already performed the stage 5 check.

Stage 7 – Do They Cross?

Having completed the above trivial checks, if the question is still not settled, we must calculate where C’D’ crosses the Y axis (X=0).

Here is some more maths, but really simple, similar triangles. Look at C’D’ crossing the Y axis below at point F:-

The smaller shaded triangle is “similar” to the larger triangle formed by C’D’ and (D’.x,C’.y). To compute the Y position of point F:-

F.y = C'.y - (((D'.y - C'.y) * C'.x) / (D'.x - C'.x) )

Note that we already know that C’D’ is not vertical, so D’.x-C’x cannot be zero – there is no need to apply a division by zero check before executing the above calculation.

We now have the position of F. We simply need to check it against line A’B’.

Thus, if F.y < 0 or F.y> B’.y, the line segments do not cross.

Stage 8 – Calculate the Intersection Point (optional)

If we need the intersection position, we will need to reverse the rotation and translation of stage 3, upon point F.

First, rotate point F the other way, using the rotation through the angle -a, using again these formulae:-

x' = x cos a - y sin a
y' = y cos a + x sin a

…but since F.x=0, sin(-a) = -sin(a), and cos(-a) = cos(a), the operation becomes much simpler:-

F'.x = F.y * sin a
F'.y = F.y * cos a

Then, translate the rotated F’ by the position of A:-

F'.x = F'x + A.x
F'.y = F'y + A.y

… to derive the actual “world” position of the intersection point F, rather than relative to AB.

Disadvantages

Clearly this method is not as elegant as the cross product solution. The single square root required is relatively slow.

Advantages

This method much is easier to comprehend and illustrate. The trivial checks at different stages can save un-needed calculations.

Improvements

If a complex system, where many line segments must be checked against each other, a simple improvment to performance could be derived by caching the line segment lengths as they are calculated.

Adding Figures

  1. Using the line segments AB and CD in the worked example:-AB is (-5,-2) to (1,2)
    CD is (-8,-2) to (2,1)
  2. The boundary boxes cross, so we proceed to translate so that A is at (0,0).
  3. Subtract A from B,C,D i.e. subtract (-5,-2)
  4. AB is now (0,0) to (6,4)
    CD is now (-3,0) to (7,3)
  5. Calculate length(AB) using Pythagoras.
    Length(AB) = 7.2111025
  6. Calculate sin(a) = B.x / length(AB) = 0.83205029433784372
  7. Calculate cos(a) = B.y / length(AB) = 0.55470019622522915
  8. Rotate AB and CD around (0,0)
    A’B’ is (0,0) to (0,7.2111025)
    C’D’ is (-1.66410059,-2.49615) to (1.38675,7.4884)
  9. C’D’ is not co-linear or vertical, and crosses X=0, so calculate where it crosses.
  10. C’D’ crosses X=0 at point F.
  11. F.Y = 2.9499964
    (F.X = 0, for clarity)
  12. Check if F.Y is between 0 and B’.Y or 7.2111025. It is, so the line segments do cross.
  13. Rotating F the opposite way, and translating by -A, we calculate the intersection point(-2.5454545,-0.3636363740)

Code Example

This C# example uses two classes Point2D and LineSeg2D

[sourcecode language=”csharp”]

public struct ConstantValue
{
internal const double SmallValue = 0.00001;
}

public class Point2D
{
private float X;
private float Y;

public Point2D(double src_x, double src_y)
{
this.X = (float)src_x;
this.Y = (float)src_y;
}

public Point2D(float src_x, float src_y)
{
this.X = src_x;
this.Y = src_y;
}

public float X
{
set
{
this.X = value;
}
get
{
return this.X;
}
}

public float Y
{
set
{
this.Y = value;
}
get
{
return this.Y;
}
}
}

public class LineSeg2D
{
private Point2D m_Start;
private Point2D m_End;

public LineSeg2D(Point2D start, Point2D end)
{
this.m_Start = startPoint;
this.m_End = endPoint;
}

// returns true if intersection, and returns position
// if a buffer is supplied (intersect_point)
//
public bool CheckIntersect(Point2D intersect_point, LineSeg2D other_line)
{

//check if points are same

if (
this.m_End.X == other_line.m_End.X &&
this.m_End.Y == other_line.m_End.Y &&
this.m_Start.X == other_line.m_Start.X &&
this.m_Start.Y == other_line.m_Start.Y
)
{
//same other_line – mark as crossing
if (intersect_point != null)
{
intersect_point.X = this.m_End.X;
intersect_point.Y = this.m_End.Y;
}

return (true);
}

if (
this.m_End.X == other_line.m_Start.X &&
this.m_End.Y == other_line.m_Start.Y &&
this.m_Start.X == other_line.m_End.X &&
this.m_Start.Y == other_line.m_End.Y
)
{
//same other_line – mark as crossing
if (intersect_point != null)
{
intersect_point.X = this.m_End.X;
intersect_point.Y = this.m_End.Y;
}
return true;
}

//allow end points to coincide
if (
this.m_End.X == other_line.m_Start.X &&
this.m_End.Y == other_line.m_Start.Y
)
{
return false;
}

if (
other_line.m_End.X == this.m_Start.X &&
other_line.m_End.Y == this.m_Start.Y
)
{
return false;
}

//check line1 is not a point

if (
this.m_End.X == this.m_Start.X &&
this.m_End.Y == this.m_Start.Y
)
{
//line1 is a point
return false;
}

//check line2 is not a point
if (
this.m_End.X == this.m_Start.X &&
this.m_End.Y == this.m_Start.Y
)
{
//line1 is a point
return false;
}

//get the magnitude of line1
double line1_dx = this.m_Start.X – this.m_End.X;
double line1_dy = this.m_Start.Y – this.m_End.Y;
double line1_mag = Math.Sqrt((line1_dx * line1_dx)
+ (line1_dy * line1_dy));

//shift line2 relative to line1 START POINT

double line2_sx = other_line.m_End.X – this.m_End.X;
double line2_sy = other_line.m_End.Y – this.m_End.Y;

double line2_fx = other_line.m_Start.X – this.m_End.X;
double line2_fy = other_line.m_Start.Y – this.m_End.Y;

double sin = line1_dx / line1_mag;
double cos = line1_dy / line1_mag;

//rotate

//x’ = x cos f – y sin f
//y’ = y cos f + x sin f

double line2a_sx = (line2_sx * cos) – (line2_sy * sin);
double line2a_sy = (line2_sy * cos) + (line2_sx * sin);

double line2a_fx = (line2_fx * cos) – (line2_fy * sin);
double line2a_fy = (line2_fy * cos) + (line2_fx * sin);

//check if co-linear
if (Math.Abs(line2a_sx) < ConstantValue.SmallValue &&
Math.Abs(line2a_fx) < ConstantValue.SmallValue ) { //line2 lies on the Y axis if (line2a_sy > line2a_fy)
{
double t = line2a_sy;
line2a_sy = line2a_fy;
line2a_fy = t;
}

if (line2a_sy > line1_mag)
{
//line2 is ABOVE line1
return false;
}
if (line2a_fy < 0)
{
//line2 is BELOW line1
return false;
}

//lines are co-linear and DO interesect

if (intersect_point != null)
{
intersect_point.X = this.m_End.X;
intersect_point.Y = this.m_End.Y;
}
return true;

}

//ensure line2 crosses Y axis
if (line2a_sx < 0 && line2a_fx < 0) { //both ends are BELOW x=0 return false; } if (line2a_fx > 0 && line2a_sx > 0)
{
//both ends ABOVE y=0
return false;
}

//check if line2a is all
if (line2a_sy < 0 && line2a_fy < 0) { return false; } //check if line2a is all >mag1
if (line2a_sy >line1_mag && line2a_fy > line1_mag)
{
return false;
}

//check if line2a is parallel
if (Math.Abs(line2a_sx – line2a_fx) < ConstantValue.SmallValue)
{
return false;
}

//now calculate where line2a crosses Y axis

//using "similar trianges"
double ry = ((line2a_fy – line2a_sy) * line2a_sx) /
(line2a_fx – line2a_sx);

ry = line2a_sy-ry; //ry is where it crosses x=0;

if (ry < 0 || ry > line1_mag)
{
//crosses Y axis OUTSIDE of line1
return false;
}

if (intersect_point != null)
{
// calculate intersection point by roating (0,ry)
// by the reverse of the original
// rotation, and adding to line1 start point
//
intersect_point.X = (float)((ry * sin) + this.m_End.X);
intersect_point.Y = (float)((ry * cos) + this.m_End.Y);
}
return true;
}

}

[/sourcecode]

Spell Checking or Search Engine Suggestions using BK-Trees

Introduction

A search facility on a website or in an application should be able to function when minor spelling errors are present in the text entered by the human user. The facility may either automatically execute the search, using the corrected words, or may return one
or more suggestions to the end user.

The Problem

Whilst a database index may store keys that enable millions of records to be swiftly retrieved by means of a perfect match between the entered text words and the contents of that database, providing means to retrieve data where minor spelling errors are present
cannot be satisfied by a regular database index. For example, the word Leicester should be returned if a user enters:-

  1. liecester
  2. leicestre
  3. lecester

Each of the above entries contains a minor error, and entry (3) has a missing letter.

It will be noted from example (2) above, that one cannot assume that the first 3 or 4 letters are correct, and provide a wild card facility after the third or fourth character; this approach will limit the power of a fuzzy matching or spelling correction engine.

String Distance Calculations

Inherent in the solution presented below, is the ability to compute the difference between text strings as a numerical value, or distance.

I currently use the Levenshtein distance formula to calculate the distance between strings. The Levenshtein distance returns the number of changes required to transform string a to string b; for example the distance between ‘bristok’ and ‘bristol’ is 1 (change ‘k’ to ‘l’) and the distance between ‘bristle’ and ‘bristol’ is 2 (remove ‘e’ and insert ‘o’ before ‘l’).

BK-trees

The BK-tree was invented by Walter Austin Burkhard and Robert M. Keller. It is a metric tree specifically adapted to discrete metric spaces.

A loose description which is easier to understand is that the BK-tree is a means to greatly sub-divide a large ‘bag’ of objects into smaller collections in such a way that a search operation can easily find the correct collection and from there quickly find the desired matches.

Unlike a binary tree or a multi-branch B*Tree, a BK-tree is suitable to allow fuzzy string matching.

Creating a BK-tree

The creation of a BK-tree is quite simple. The root node is an arbitrary entry from our ‘bag’ of words. The actual strings are added as lowercase, and all searching would normally be case-insensitive.

Our words in this example, in no particular order, are:-

  • Leeds
  • York
  • Bristol
  • Leicester
  • Hull
  • Durham

The root node is set to ‘Leeds’:-

BK-tree example small tree, the root node is added, text='Leeds'

The nodes can have multiple branches, each branch being a string distance value between the node’s text, and the text of the child node; this will be seen below.

To add the first child, ‘York’, calculate the string distance between it can the root node’s text, ‘Leeds’. The Levenshtein distance is 5. Create a child node with the value ‘York’, linked from the root node with a distance value of 5.

BK-tree example small tree, the second node is added, text='York'; this is a child of the root node 'Leeds'.

The next text to be added is ‘Bristol’. Repeating the procedure above by calculating the distance between ‘Bristol’ and ‘Leeds’, the value is 7.Next, check that there are no existing branches of value 7. There are none, so repeat the procedure as performed above, creating a new node and linking to the root node:-

BK-tree example, the third node is added, text='Bristol'. This is a child of the root node.

The fourth text to be added is ‘Leicester’. The procedure is identical to the second and third values, as above. The Levenshtein distance between ‘Leicester’ and the root node ‘Leeds’ is 6, so a new node is linked as shown below:-

BK-tree example, the fourth node is added, text ='Leicester. This is a child of the root node.

The next text value is ‘Hull’. Calculate the distance between ‘Hull’ and ‘Leeds’ ; it is 5. It will be noted that there is already a branch from the root node, of distance value 5. In this event, ‘move’ down that branch and repeat the insert procedure but compare the new text value to the current node, i.e. the one to which you have just moved, in this case,’York’. The Distance between ‘Hull’ and ‘York’ is 4, and as there are no branches from ‘York’ with distance=4, now add a new node as shown below:-

BK-tree example, the fifth node is added, text='Hull'. This is a child of the 'York' node.

Note that if the ‘York’ branch already had a child to it with distance value 4, you would have moved down that branch to the child node, and repeated the above procedure until a node without a branch equal to the Levenshtein distance, is found.

Finally, add ‘Durham’ to the BK-tree, as illustrated below:-

Bk-tree example, the final node is added, text='Durham'. This is a child of the 'Leicester' node.

‘Durham’ had the Levenshtein distance from ‘Leeds’ of 6, so it was necessary to move down the branch to ‘Leicester’ and add it to that node.

The small BK-tree is now complete. Here’s how to search it.

Searching a BK-tree

The procedure for searching a BK-tree, to obtain a list of approximate matches to a word, is quite simple:-

  • Set the ‘current node’ to the root node.
  • Invoke a ‘search node’ function on the current node, passing the search word and maximum permitted Levenshtein distance from that word to the matches (max).
  • The ‘search node’ function does the following:-
    1. Calculate dist = distance from the current node’s text to the search word.
    2. If dist <= max, add current node’s text to the final text list.
    3. Recursively invoke the search node function on any child nodes of the current node where the distance of the branch is in the inclusive range dist – max to dist + max.
  • When the final invocation of the ‘search node’ function has returned, return the list of fuzzy matches found (if any).

A Worked Example

These are the detailed steps to search the above tree for the input text ‘Hill’ with a maximum allowed string distance of 1, for fuzzy matches to ‘Hill’.

The worked example is illustrated below. Note that the ‘Bristol’ branch from the root, which is not used during this search, is greyed out.

BK-tree worked example searching for fuzzy matches for 'Hill'

  1. Set the current node to the root node ‘Leeds’.
  2. Calculate the Levenshtein distance from the input text ‘Hill’ to the root node ‘Leeds’; the distance is 5.
  3. If the distance is less than or equal to maximum allowed distance, add the current node’s text to the final text list; it is not.
  4. Follow any branches from the current node, whose branch distance links are in the inclusive range distance – max to distance + max, in this case 4,5 and 6. There is no 4 branch, so go down the 5 and 6 branches.
  5. It will be noted that the ‘Bristol’ branch is not used. Typically, at each node, one or more child branches are not followed.
  6. Set the current node to ‘York’.
  7. Repeat as at stage 2; calculate the distance between the search text ‘Hill’ and the current node ‘York’. The distance is 4. If the distance is less than or equal to max, add the current node’s text to the final list. Follow branches from the ‘York’ node in the range distance – max to distance + max, that is 3,4 and 5.
  8. Follow any branches from ‘York’ with distances 3,4 and 5; there is only one, at distance 4. Set the current node to the ‘Hull’ node and execute the ‘search node’ function.
  9. Repeat as at stage 2; calculate the distance between the search text ‘Hill’ and the current node’s text value ‘Hull’. The distance is 1, which is within the maximum allowed range, so add the current node’s text value ‘Hull’ to the final list.
  10. There are no branches from the ‘Hull’ node, so exit the ‘search node’ function.
  11. This will return the current node to the previous node, ‘York’. Follow any more branches in the range 3,4,5 (as at stage 8). There are none, so exit the ‘search node’ function.
  12. This will return the current node to the previous node, the ‘Leeds’ node. Follow any further branches in the range 4,5 and 6 as stated at stage 4. The next branch to be followed is distance 6. Set the current node to ‘Leicester’, and invoke the ‘search node’ function.
  13. Hopefully the next stage will be clear. Test the current node’s text against the search test, by calculating the Levenshtein distance. If the distance is 1 or less, add the current node’s text to the final list, and follow any child branches within the range as described above.
  14. When the final invocation of ‘search node’ has completed, exit with the list of matches, in this case just a single match, ‘Hull’.

Google Android Operatng System Logo - a medium green robotAndroid Versions for Android Virgins

Introduction

Google’s Android Operating System for mobile devices has now evolved through four major versions. This article briefly describes the major features of the principle releases of Android 1-4.

From Android 1.5 onwards, each release has been named.

Android 1.0

Android was born in September 2008 with the release of version 1.0. At that time, Android releases were un-named.

The first Android device was the HTC Dream.

Android 1.5 Cupcake

Android 1.5 Cupcake Logo

Released in April 2009, Cupcake introduced some familiar features to the Android world:-

  • Uploading of videos and photos to YouTube and Picassa respectively
  • MPEG-4 (mp4) recording and playback
  • Widget support
  • Auto screen rotation
  • Animated screen transitions

Android 1.6 Donut

Android 1.6 Cupcake Logo

Version 1.6 was released a year after the first Android release, and included:-

  • Support for 800×480 screen resolution
  • Ability for developers to include their content in search results
  • Android Marketplace inprovements, including app screenshots
  • Performance improvements to searching and camera applications

Android 2.0-2.1 Eclair

Android 2 Eclair Logo

Android 2 was introduced in October 2009 with the release of “Eclair”.

Android 2.2 Froyo

Android 2.2 Froyo Logo

In May 2010, 2.2 was released, codenamed “Froyo”. Features included:-

  • Speed, memory, and performance optimizations
  • Adobe Flash support
  • Wi-fi hotspots support
  • Impoved Microsoft Exchange support
  • Numeric and alphanumeric passwords

Android 2.3x Gingerbread

Android 2.3x Gingerbread Logo

This is currently (May 2012) probably the most common pre-installed OS on Android phones and some tablets.

Introduced in December 2010, this version’s features include:-

  • Updated and improved UI
  • Support for larger screen resolutions (1280×768 and above)
  • Improvements to garbage collection and power management
  • Multiple camera support
  • Native VOIP support

Some of the latest devices which are pre-installed with Gingerbread can now be updated to Ice Cream Sandwich (such as the Samsung Galaxy S2).

Android 3.x – Honeycomb

Android 3.x Honeycomb Logo

Android 3 is exclusively for tablet devices, though some tablets continue to be pre-installed with Android 2.xx releases even after Honeycomb’s introduction in February 2011.

Android 4.0.x – Ice Cream Sandwich

Android 4.x Ice Cream Sandwich Logo

The latest release of the Android Operating System, Ice Cream Sandwich is now pre-installed on the newest Android phones, such as the Galaxy S3.

Android 4 was released in October 2011 and includes:-

  • Enhanced speed and performance
  • Virtual buttons in the UI, in place of capacitive or physical buttons
  • Ability to access apps directly from lock screen
  • New tabbed web browser
  • 1080p video recording
  • UI hardware acceleration

A Beginner’s Guide to Understanding Unicode for the Application Programmer and Web Developer.

1 Introduction

To developers who speak and work in English, French, Spanish and other Western European languages, Unicode character encoding is often either ignored, considered late or even retro-fitted to projects, whilst the development team may have only a rudimentary understanding of Unicode and character encoding systems such as UTF-8.

The concepts are quite straightforward and can easily be grasped and whilst a detailed knowledge of, for example, the encoding system used by UTF-8 is not necessary, a general understanding of Unicode and how it is implemented in today’s applications and web servers is useful.

2 ISO-8859-1 and ASCII

2.1 Background

Before looking at Unicode let us examine a simple 8 bit byte-per character encoding scheme.

ISO-8859-1 is an 8-bit single byte character encoding scheme for Western European languages.

The first 128 characters of ISO-8859-1 is the original ASCII character-set (the numbers from 0-9, the uppercase and lowercase English alphabet, and some special characters).

The higher part of ISO-8859-1 (codes from 160-255) contains the characters used in Western European countries and some commonly used special characters.

ISO-8859-1 thus provides character encoding for English and many Western European languages. Character encoding is simple (a character is a single byte) and space efficient, but the scope is limited to 255 possible characters.

2.2 Example Use

In web server usage, if a web page will not contain any non ISO-8859-1 characters, the web server character set may be set to ISO-8859-1;  a generic method is shown below:-

[sourcecode language=”html”]
<meta http-equiv="content-type"
content="text/html; charset=iso-8859-1" />
[/sourcecode]

A JSP page fragment is shown below:-

[sourcecode language=”java”]
<%@ page contentType="text/html;charset= iso-8859-1" %>
[/sourcecode]

2.3 Example Character

An example ISO-8859-1 character is the pound sterling £character, which is code 163 decimal or 0xA3 hexadecimal.

3 Bytes and Characters Confusion

A byte is (almost always) an 8-bit unit of information.  Historically, however, ASCII text strings were represented as a single byte per character, and the word “character” was used as a synonym for “byte”.

Furthermore, the ANSI C byte storage unit is char and since this was used to store a single ASCII character, the terms byte, char and character were often used interchangeably in documentation.

Today’s developer should consider bytes and characters as distinct and different units at all times.

4 How Unicode is Organised

4.1 Introduction

Unicode extends the range of code points far beyond the 8 bit limits of ISO-8859-1 and byte storage. Unicode characters have been assigned for virtually all spoken and written languages, plus mathematical and other symbols, and historic languages.

4.2 Code Points

A code point is simply the numerical value that defines a given character within the Unicode definition.

4.3 ASCII in Unicode

The ASCII characters, with which English speaking developers are so familiar, are of course a subset of the Unicode code point range.

The code points 0-U+7F (or 0-127 decimal) are ASCII characters; an ASCII code is thus identical to its Unicode code point value.

4.4 Unicode Planes

A Unicode plane is a collection of 216 (65536 or 0x10000 hex) code points.

The first plane, that is the code points 0 – U+FFFF, is called the Basic Multilingual Plane (BMP). This comprises the most Unicode characters that have been assigned so far. The BMP contains characters for almost all modern languages, and a large number of special characters.

It will be noted that the BMP can be represented in 16 bits; in other words, the Unicode characters that are used in most instances are able to be represented in 16 bits.

4.5 Example Unicode Characters

  • A (U+41)
  • £ (U+A3)
  • χ (U+3c7)
  •  Treble Clef (U+1d11e). Note that this lies outside of the Basic Multilingual Plane (BMP).

4.6 Unicode Character Pages (Partial List)

These are some of the principal character page locations within Unicode. This is not intended to be a comprehensive list.

4.6.1. Basic Multilingual Plane (BMP) 0 – U+FFFF
Most of the Unicode characters assigned lie in the BMP. These are just a few of the blocks of code points:-

  • C0 Controls and Basic Latin (0000-007F)
  • C1 Controls and Latin-1 Supplement (0080-00FF)
  • Latin Extended-A (0100-017F)
  • Latin Extended-B (0180-024F)
  • Greek and Coptic (0370-03FF)
  • Cyrillic (0400-04FF)
  • Armenian (0530-058F)
  • Hebrew (0590-05FF)
  • Arabic (0600-06FF)
  • Syriac (0700-074F)
  • Bengali (0980-09FF)
  • Gujarati (0A80-0AFF)
  • Tamil (0B80-0BFF)
  • Telugu (0C00-0C7F)
  • Kannada (0C80-0CFF)
  • Malayalam (0D00-0D7F)
  • Thai (0E00-0E7F)
  • Lao (0E80-0EFF)
  • Tibetan (0F00-0FFF)
  • Myanmar (1000-109F)
  • Georgian (10A0-10FF)
  • Hangul Jamo (1100-11FF)
  • Ethiopic (1200-137F)
  • Cherokee (13A0-13FF)
  • Unified Canadian Aboriginal Syllabics (1400-167F)
  • Ogham (1680-169F)
  • Runic (16A0-16FF)
  • Tagalog (1700-171F)
  • Khmer (1780-17FF)
  • Mongolian (1800-18AF)
  • Phonetic Extensions (1D00-1D7F)
  • Latin Extended Additional (1E00-1EFF)
  • Greek Extended (1F00-1FFF)
  • General Punctuation (2000-206F)
  • Superscripts and Subscripts (2070-209F)
  • Currency Symbols (20A0-20CF)]
  • Letterlike Symbols (2100-214F)
  • Number Forms (2150-218F)
  • Arrows (2190-21FF)
  • Mathematical Operators (2200-22FF)
  • Miscellaneous Technical (2300-23FF)
  • Control Pictures (2400-243F)
  • Optical Character Recognition (2440-245F)
  • Enclosed Alphanumerics (2460-24FF)
  • Box Drawing (2500-257F)
  • Block Elements (2580-259F)
  • Geometric Shapes (25A0-25FF)
  • Miscellaneous Symbols (2600-26FF)
  • Dingbats (2700-27BF)
  • Miscellaneous Mathematical Symbols-A (27C0-27EF)
  • Supplemental Arrows-A (27F0-2a7FF)
  • Braille Patterns (2800-28FF)
  • Supplemental Arrows-B (2900-297F)
  • Katakana (30A0-30FF)
  • Bopomofo (3100-312F)
  • Hangul Compatibility Jamo (3130-318F)
  • Kanbun (3190-319F)
  • Bopomofo Extended (31A0-31BF)
  • Katakana Phonetic Extensions (31F0-31FF)
  • Enclosed CJK Letters and Months (3200-32FF)
  • CJK Compatibility (3300-33FF)
  • CJK Unified Ideographs Extension A (3400-4DBF)
  • Yijing Hexagram Symbols (4DC0-4DFF)
  • CJK Unified Ideographs (4E00-9FFF)
  • Yi Syllables (A000-A48F)
  • Yi Radicals (A490-A4CF)
  • Hangul Syllables (AC00-D7AF)
  • Surrogates and Private Use Area (D800-F8FF)
  • CJK Compatibility Ideographs (F900-FAFF)
  • Alphabetic Presentation Forms (FB00-FB4F)
  • Arabic Presentation Forms-A (FB50-FDFF)
  • Variation Selectors (FE00-FE0F)
  • Combining Half Marks (FE20-FE2F)
  • CJK Compatibility Forms (FE30-FE4F)
  • Small Form Variants (FE50-FE6F)
  • Arabic Presentation Forms-B (FE70-FEFF)
  • Halfwidth and Fullwidth Forms (FF00-FFEF)

4.6.2 Supplementary Multilingual Plane (SMP) U+1000 – U+1ffff
Mostly used for historic scripts and musical and mathematical symbols.

Some of the principal blocks are:-

  • Linear B Syllabary (10000-1007F)
  • Linear B Ideograms (10080-100FF)
  • Aegean Numbers (10100-1013F)
  • Old Italic (10300-1032F)
  • Gothic (10330-1034F)
  • Ugaritic (10380-1039F)
  • Deseret (10400-1044F)
  • Shavian (10450-1047F)
  • Osmanya (10480-104AF)
  • Cypriot Syllabary (10800-1083F)
  • Byzantine Musical Symbols (1D000-1D0FF)
  • Musical Symbols (1D100-1D1FF)
  • Tai Xuan Jing Symbols (1D300-1D35F)
  • Mathematical Alphanumeric Symbols (1D400-1D7FF)

4.6.3 Higher Planes
There are further planes in Unicode, providing code points (many not yet assigned) up to U+10FFFF.

5 Unicode Storage

5.1 Introduction

As we have seen, ASCII characters were traditionally stored in a byte per character; the system was very efficient in terms of space, very easy to program (easy to count characters for example) but unable to represent higher value code points.

This section explains the various storage techniques available to developers today.

5.2 UTF-32 – “The Simplest Method”

UTF-32 is simply an extension of the above byte-per-character ASCII storage method. In UTF-32, each character is stored as a 32 bit value.

If storage space and transmission times were of no importance in applications and on the Internet, every text string could be stored and transmitted in UTF-32.

A system using UTF-32 is very easy to program and maintain, but very inefficient in terms of storage. A Web page of 1000 ASCII characters, for example, would require a 4000 byte download, rather than the 1000 bytes we would expect (these figures ignore potential web server compression such as gzip).

5.3 UTF-16

5.3.1 Overview
UTF-16 is widely used in today’s environments, and offers the developer a virtually seamless means to use Unicode text strings. It offers simplicity and relatively efficient space encoding, although if you are working only characters in the code point range 0 – U+ff, the text will require twice the storage space of a simple byte-per-character system.

If you are a Java, or .NET (C#, Visual Basic) developer or a Windows C++ programmer working in Unicode you are using UTF-16 when you manipulate text within your code.

Most of the Unicode characters you will be likely to use are found in the range 0-0xffff, the Basic Multilingual Plane (BMP). These code points can be represented by a single 16 bit character.

In the event that you need to store a Unicode character outside of the BMP, the character is represented as a surrogate pair of 16 bit characters. Some of the above languages have some support for surrogate pairs in the event that you would ever use a Unicode character outside of the BMP.

5.4 USC-2

You may hear the terms UCS-2 and UTF-16 used interchangeably. UCS-2 is a 16 bit fixed width character encoding system, identical to UTF-16 except that it does not support surrogate pairs; UCS-2 cannot handle Unicode characters outside of the BMP (above U+FFFF).

5.5 UTF-8

5.5.1 UTF-8 Encoding

Many developers will be familiar with the term UTF-8.. This is a highly space efficient means to store a sequence of any Unicode characters. Put simply, it uses as few bytes as possible. Instead of using a fixed 2 or 4 bytes per character as in UTF-16 and UTF-32, it attempts to use 1, 2, 3 or 4 bytes.

Without giving a complete technical explanation:-

  • ASCII characters 0-127 (decimal) are encoded as a single byte. Thus, English text looks identical in UTF-8; this is especially useful when using a debugger.
  • Code points U+80 to U+7FF are stored as 2 bytes.
  • Code points U+800 to U+FFFF are stored as 3 bytes.
  • Code points U+10000 U+10FFFF are stored as 4 bytes.

Thus, all Unicode characters can be stored very efficiently, with the most common characters being stored in 1 or 2 bytes.

The disadvantage is the relative complexity in handling the text internally given the different lengths of character encodings.

5.5.2 Web Pages

If, for example, a web page is encoded using UTF-8, the web server responds to HTTP requests with a UTF-8 encoded text string. The number of bytes to be transmitted is reduced in comparison to UTF-16 or UCS-2 encoding. The web browser knows how to decode UTF-8, and each character is re-assembled in the web browser as the correct Unicode code point.

5.5.3 An Encoding Example

Returning to our earlier example of the £ sterling character, code point U+A3  (or 163 decimal), the UTF-8 encoding for this character is the 2 byte sequence:-

0xC2 0xA3

It is not uncommon to see the £ character incorrectly displayed on a web page. The reason is usually either:-

  • The web page contains UTF-8 data but the character set is described as ISO-8859-1, so two ISO-8859 characters are displayed, in error.
  • The character set is described as UTF-8 but the character has not been UTF-8 encoded; the browser attempts to decode the text starting at our single byte as a UTF-8 sequence.

6 Code Examples

6.1 HTML

Unicode characters are easily inserted into a web page, such as:-

[sourcecode language=”html”]

<p> &#x039C; &#x03B5; &#x03C6;</p>
[/sourcecode]

…which, if the web page is correctly set to a Unicode character set (such as UTF-8), and a suitable font is present in the client system, will produce some Greek characters.

6.2 .NET C#

6.2.1 Looking at UTF-8 Encoding

Consider the following code snippet to create a String which will be written into a web page whose charset is UTF-8:-

[sourcecode language=”csharp”]

String test = "Velocit" + (char)0xE0;

[/sourcecode]

We have created a String using ASCII characters but appended the Unicode code point U+E0 which is the character à. It may have been simpler to have defined the entire word using non-ASCII characters in the source code, but these can be difficult to type.

The code below shows how to view our String as UTF-8 bytes:-

[sourcecode language=”csharp”]

System.Text.UTF8Encoding encoding = new System.Text.UTF8Encoding();

byte[] br = encoding.GetBytes(test);

int i = 0;

while (i < br.Length)

{

Console.WriteLine("Byte " + i + "=" + br[i].ToString("x2"));

i++;

}

[/sourcecode]

The String Velocità is encoded as the ASCII values for all but the final character which as can be seen is in the range U+80 – U+7FF and is thus encoded as two bytes (0xC3 0xA0) under UTF-8.

Whilst the developer does not normally need to be aware of the mechanism of UTF-8, in can become useful when inserting text from a database, for example; some conversion may need to be executed.

6.2.2 Counting Characters

Strings are stored as UTF-16 characters in .NET. This fact is seldom apparent to the developer; the .NET String methods are powerful and allow simple text manipulation.

Consider the following simple code snippet to create a String and count the number of characters:-

[sourcecode language=”csharp”]

String test = "Clef”;

Console.WriteLine("Number of characters=" + test.Length);

[/sourcecode]

The number of characters is correctly reported as 4.

If we change the String to include a non-BMP character, in this case the treble clef  (U+1d11e)  we will see a problem. The new character is encoded as two UTF-16 characters, but the String class does not “know” that the two surrogate UTF-16 characters are actually one; thus the following  code returns not 5 characters but 6:-

[sourcecode language=”csharp”]

String test = "clef" + Char.ConvertFromUtf32(0x1d11e);

Console.WriteLine("Number of characters=" + test.Length);

[/sourcecode]

(Note also the means by which the UTF-32 character 0x1D11E is added to the String.)

To allow the correct counting of characters when surrogate pairs are in use, the following code illustrates the necessary technique in a simple function:-

[sourcecode language=”csharp”]

//return number of characters in a String even if surrogate pairs are //present
int CountChars(String text)
{
int test_len = text.Length;
int i = 0;

int num_chars = 0;

while (i < test_len)
{

if (Char.IsSurrogate(text, i) == true)
{
i++;            //skip a character so that the surrogate pair
//is counted once.
}

num_chars++;
i++;
}
return (num_chars);
}

[/sourcecode]

It is probably unlikely that the above situation will occur as most Unicode characters that are likely to be used lie in the Basic Multilingual Plane (0- U+FFFF).

6.3 C++

6.3.1 Unicode in C++

Many C++ programmers begin text implementation as char arrays of ISO-8859-1 characters. There are a variety of means to implement Unicode in C++.

6.3.2 Use Wide Characters Explicitly

It is a relatively simple switch from char to the 16-bit “wide” character  wchar_t.Strings are defined using the L prefix, and the ASCII text functions str??? are replaced in usage by their wide character equivalents wcs???. The following code example shows  the definition and use of a Unicode string in C++. Note that wcslen returns the number of characters, not the length of the string in bytes.

[sourcecode language=”cpp”]

const wchar_t test[] = L"Velocità";

printf("Num chars = %d\r\n", wcslen(test));

[/sourcecode]

The advanced code example below shows how to convert any Unicode code point (even outside the BMP) into a single UTF-16 character, or a surrogate pair.

[sourcecode language=”cpp”]

// convert a single unicode char to UTF-16
// note that if src>0xffff it returns two 16 bit wide chars
// (surrogate pairs)
//
// if the char is in the BMP, c1 will always be 0, so only c0
// should be used

void ToUTF16(wchar_t &c0,wchar_t &c1,const unsigned long src)
{

if(src > 0xffff)
{

// 0x10000 is subtracted from the code point, leaving a 20 bit number
// in the  range 0..0xFFFFF.

// The top ten bits (a number in the range 0..0x3FF) are added to 0xD800 to give
// the first code point or high surrogate, which will be in the range
// 0xD800..0xDBFF.
// The low ten bits (also in the range 0..0x3FF) are added to 0xDC00 to give
// the second code point or low surrogate, which will be in the range
// 0xDC00..0xDFFF.

unsigned long v1 = src – 0x10000;

unsigned long v2 = v1 & 0x3ff;       //lower 10 bits

v2 += 0xdc00;

//get the top 10 bits

v1 >>= 10;
v1 &= 0x3ff;
v1 += 0xd800;

//add the surrogate pair

c0 = (wchar_t)v1;
c1 = (wchar_t)v2;

}
else
{

//BMP char (basic multilingual plane)
c0 = (wchar_t)src;
c1 = 0;
}
}

[/sourcecode]

6.3.3 Use the Standard Library

Many C++ programmers prefer to use the standard library (in preference to MFC, for example). The 8 bit string class std::string has a wide character equivalent std::wstring.

6.3.4 Define _UNICODE

Windows C++ programmers (not CLR) can opt to define their project as Unicode. The use of pre-processor directives can make the switch to Unicode within a project very simple, and can even allow for a reversion to 8-bit characters.

In Visual Studio (or Visual C++ before that) when a project is defined as using Unicode characters (the default):-

  • TCHAR is defined as WCHAR  which is itself defined as the wide character wchar_t
  • Much of the Windows API uses TCHAR to reference Strings; in effect, the API uses wide characters.
  • The MFC class CString now becomes a Unicode vessel.
  • Windows  provides the TEXT macro as a means to define text literals in whichever encoding system is currently in use.

7 References

For further information on Unicode please visit The Unicode Consortium at:-

https://www.unicode.org