Tips on Using BK-Trees
A search facility on a website or in an application should be able to function when minor spelling errors are present in the text entered by the human user. The facility may either automatically execute the search, using the corrected words, or may return one
or more suggestions to the end user.
Whilst a database index may store keys that enable millions of records to be swiftly retrieved by means of a perfect match between the entered text words and the contents of that database, providing means to retrieve data where minor spelling errors are present
cannot be satisfied by a regular database index. For example, the word Leicester should be returned if a user enters:-
Each of the above entries contains a minor error, and entry (3) has a missing letter.
It will be noted from example (2) above, that one cannot assume that the first 3 or 4 letters are correct, and provide a wild card facility after the third or fourth character; this approach will limit the power of a fuzzy matching or spelling correction engine.
Inherent in the solution presented below, is the ability to compute the difference between text strings as a numerical value, or distance.
I currently use the Levenshtein distance formula to calculate the distance between strings. The Levenshtein distance returns the number of changes required to transform string a to string b; for example the distance between ‘bristok’ and ‘bristol’ is 1 (change ‘k’ to ‘l’) and the distance between ‘bristle’ and ‘bristol’ is 2 (remove ‘e’ and insert ‘o’ before ‘l’).
The BK-tree was invented by Walter Austin Burkhard and Robert M. Keller. It is a metric tree specifically adapted to discrete metric spaces.
A loose description which is easier to understand is that the BK-tree is a means to greatly sub-divide a large ‘bag’ of objects into smaller collections in such a way that a search operation can easily find the correct collection and from there quickly find the desired matches.
Unlike a binary tree or a multi-branch B*Tree, a BK-tree is suitable to allow fuzzy string matching.
The creation of a BK-tree is quite simple. The root node is an arbitrary entry from our ‘bag’ of words. The actual strings are added as lowercase, and all searching would normally be case-insensitive.
Our words in this example, in no particular order, are:-
The root node is set to ‘Leeds’:-
The nodes can have multiple branches, each branch being a string distance value between the node’s text, and the text of the child node; this will be seen below.
To add the first child, ‘York’, calculate the string distance between it can the root node’s text, ‘Leeds’. The Levenshtein distance is 5. Create a child node with the value ‘York’, linked from the root node with a distance value of 5.
The next text to be added is ‘Bristol’. Repeating the procedure above by calculating the distance between ‘Bristol’ and ‘Leeds’, the value is 7.Next, check that there are no existing branches of value 7. There are none, so repeat the procedure as performed above, creating a new node and linking to the root node:-
The fourth text to be added is ‘Leicester’. The procedure is identical to the second and third values, as above. The Levenshtein distance between ‘Leicester’ and the root node ‘Leeds’ is 6, so a new node is linked as shown below:-
The next text value is ‘Hull’. Calculate the distance between ‘Hull’ and ‘Leeds’ ; it is 5. It will be noted that there is already a branch from the root node, of distance value 5. In this event, ‘move’ down that branch and repeat the insert procedure but compare the new text value to the current node, i.e. the one to which you have just moved, in this case,’York’. The Distance between ‘Hull’ and ‘York’ is 4, and as there are no branches from ‘York’ with distance=4, now add a new node as shown below:-
Note that if the ‘York’ branch already had a child to it with distance value 4, you would have moved down that branch to the child node, and repeated the above procedure until a node without a branch equal to the Levenshtein distance, is found.
Finally, add ‘Durham’ to the BK-tree, as illustrated below:-
‘Durham’ had the Levenshtein distance from ‘Leeds’ of 6, so it was necessary to move down the branch to ‘Leicester’ and add it to that node.
The small BK-tree is now complete. Here’s how to search it.
The procedure for searching a BK-tree, to obtain a list of approximate matches to a word, is quite simple:-
These are the detailed steps to search the above tree for the input text ‘Hill’ with a maximum allowed string distance of 1, for fuzzy matches to ‘Hill’.
The worked example is illustrated below. Note that the ‘Bristol’ branch from the root, which is not used during this search, is greyed out.
Please refresh the page and try again.