Super cool yet unpopular text functions for Spelling Distance

Text Mining and Analytics on unstructured data are buzz words these days in the world of Analytics.

Well, then why @ Ask Analytics, we should be lagging behind ...

Let's start with few basic Text function that can be leveraged during text mining exercises for string comparison and standardization and then we would build up further on complex algorithms. These functions are good, but I don't know why have remained unpopular.

So are you ready ?


Text difference or Spelling Distance

What is 2 + 3? ... It is 5 ! ... What is 5-3 ? ... It is 2 !

All right, good ! Now tell me what is A - B ? I know you ask me to provide values of A and B, right ?

So let me ask you what is Anlytics - Ask  or say ... XYZ - ABC ?

Have you gone nuts ! I know this would be your first reaction if you haven't ever heard of SPEDIS function in SAS.

Spedis is abbreviation of Spelling Distance which is used to check the difference between two spellings of similar word ... Well, it also follows garbage in garbage out policy of computer world, hence can calculate the difference between any two strings.

There are two more sibling functions : Complev and Compged

So basically all there function measure the degree of difference between two strings -- called spelling distance.

A brief Intro :


COMPGED and SPEDIS are almost similar, COMPGED is much more efficient though.


Both the functions COMPGED and SPEDIS performs operations on one string to make it same to other one; the operations can be insertion, deletion, replacing etc. With each operation a cost is associated.

Operations can be 1. Deletion 2. Insertion 3. replacement; there is a cost penalty associated with each operation.

Let's better  learn it by example ( You can try your own example and understand penalty pattern):

Data Demo;
input a  $ b $ ;
cards ;
abacus abcus
abcus abacus
child chill
spot sport
cats cat
cat cats
ad mad
mad ad
albert calbert
mango manbo
random radio
;
run;

Data distance;
set Demo;
spelling_distance = spedis(a,b);
lev_distance = complev(a,b);
GED_distance = compged(a,b);
proc print;
Run;

Result of above code:





There is also penalty against case change, so better we should compare converting all text to lowcase. Penalty for change in starting being higher than that for change is end ... is fine by common sense. 
We have explained the COMGED function details as we recommend this function over COMPLEV and SPEDIS.

Levenshtein distance counts only number of operations is not very effective way to judges.

By the way, SPEDIS and COMPGED are also known as functions for Fuzzy matching of text.

So where do it use these functions ?

We need to use these functions while the data is slightly unstructured and we need to make it structured.

Example : Suppose you have a data with city name and sales. At data entry level itself, someone has filled city name manually and he has committed minor mistakes in spellings, as shown below:

New York  $10,000
Neo York  $12,000
.
.
.
New Jersey  $12,500
New Jersy  $12,500
.
.
.
Manhattan  $14,100
Manhatan   $7,000
.
.
.
You can use these functions to identify the matching city names and correct it.

Well, there are multitudes of usage that you can think and innovate at your end.




Coming next in the series is -- Sound based matching of Text.



Enjoy reading our other articles and stay tuned with ...

Kindly do provide your feedback in the 'Comments' Section and share as much as possible.


No comments:

Post a Comment

Do provide us your feedback, it would help us serve your better.