Ask Analytics: Spellbinding Proc Spell

Spellbinding Proc Spell

Hidden Gems of SAS - 2

Supercalifragilisticexpialidocious, the is the largest possible adjective I could find while I got to learn about this hidden gem of SAS : PROC SPELL. Learn it and believe me if you ever need to work in text mining, it would make your life so easy.

The pic is no exaggeration for this SAS procedure!

Let me share one of my past experience to set a context.

The story is little boring, but it is worth listening to:

Once upon a time ... in one of my previous organizations, I got to work in a Text Mining project where we were dealing the worst possible text ever. The text data was ]about all possible commodities/things on earth and was full of all possible typos /spelling variations.

Example: For the word "Industry", these were all possible correct and incorrect variants such as: "Industries", "Industrys" "industri", "indastry", "industree " ...... so on and we were suppose to do spelling correction first for all the words and bring them to basic word "Industry".

I would not like to steal the credit as there used to be an excellent SAS programmer in the team (I never had a chance to meet with her), who made a 500+ lines algorithm that used to accomplish the herculean task.

Basically, the algorithm was parsing all the text into words and then a huge cartesian product of the words' list was being prepared. Then all the similar words were being identified on the basis of "Spelling Distance" and "Phonetic Similarity". The root word was being identified on the basis of mode i.e. maximum occurring word was being considered as ROOT.

For learning "Spelling Distance", read:

Spelling distance based matching (Spedis, Compged and complev functions)

For learning "Phonetic Similarity, read:

Sound based matching (Soundex)

Though the logic didn't fail in most of our test cases, but there was a flaw in the algorithm : What if the maximum occurring word itself is mis-spelled? Hence the idea was not full proof (No offense to any one, and I mean it. I really pay my best respect to the person who wrote that code). Also the execution of the code used to take lots of time for a large data.

I am not blaming anyone, neither I am saying that the algorithm is useless, in fact the same would be required, even when you use Proc Spell. The only thing that I want to emphasize here is that most of us are not aware of this beautiful and powerful package : Proc Spell ... and idea that I want to covey is that the algorithm (referred above) can be improved with the help of this package.

A great person has once said :

Let's see how the Proc Spell works :

First create a misspelled words.txt file with following content:

Industries understand special traininng needs
Industry understand special training needs
Industrys understund special traininng needs
industri understand spesial training needs
indastry usderstand special training needs
industre undarstund special trainng needs
You should not trast anyone blindly

For demo, we have given too many spelling errors in it.

/* Let's now import the file into SAS. */

%let location = G:\AA\SAS gems;
libname AA "&location.";
filename sample "&location.\misspelled words.txt";

/* In the first step, we try to create a catalogue of words in the file */
Proc SPELL words = sample
Create dict = AA.mycatgalog.Spell;
Run;

/* Now initiate a file for accommodating required output */

Proc Printto print = "&location.\output.txt" new; Run;

/* Now with the help of Proc Spell, we try to identify the misspelled words and seek suggestion to correct those, and take output in the above initialized file */

Proc Spell in = sample
dictionary = AA.mycatgalog.Spell
verify suggest;
run;
Proc Printto print = print; Run;

/* Open the output file to understand the output of Proc Spell, let's get the output back into SAS */

Data AA.List_correction;
infile "&location.\output.txt" missover firstobs = 7 ;
input A & $1000. ;
Run;

/* Looks like */

/* Transform the output file into readily usable form */

Data AA.List_correction;
set AA.List_correction;
retain id 1;
if A = "" then id +1;
Run;
Proc transpose data = AA.List_correction out = aa.transposed;
by id;
var A;
where A ~="";
Run;

data aa.transposed;
length suggested $1000.;
retain id original_word suggested;
set aa.transposed (drop = _name_);
rename Col1 = original_word;
suggested = scan(Col2,2,":");

drop col2;
Run;

Data aa.transposed;
retain id original_word suggested;
set aa.transposed;
run;

... and here we are with a list of incorrect words with suggested correction. For few words, we might not get any and for others, we might get more than one, Now to it is time to build the further algorithm to replace wrong word with the most appropriate corrected word. You can build a macro and use tranwrd function to replace the word.

Humble appeal: