Add music to your SAS code

 Don't you worry, I won't start singing a song !

Personally I believe that Analytics and Music are alike ! In fact I believe that Analytics practices are inspired from Music ... how come ?

Well, like we analysts find trend and pattern, ups and downs in data, Music is also all about finding pattern with various combinations of upper and lower notes; Music, like sales graph, goes up and down in scale, tempo etc.

While term "Data Scientist" is trending across analytics industry, I yet want to be known as a "Data Artist". So why don't we merge music and analytics ... Let's learn how music (sound) works in Analytics.


Introducing Soundex function to you...


In one of the previous articles we had covered fuzzy matching of text. In this article we are covering the concept of text matching with Sound.

Consider the following example :

We have two datasets 1. English, 2. Maths with respective subject's marks of students. Teachers of Maths and English have written slightly different spelling of students' names . Now we need to collate the data.

Data English;
input Name $ English_marks;
cards;
smyth 87
Willy 67
Rama 65
Geetha 95
Soorya 45
;
Run;

Data Maths;
input Name $ Maths_marks;
cards;
smith 78
Wily 76
Ram 56
Geeta 59
Surya 54
;
Run;

We can use sound based matching of strings in such cases :

Proc Sql;
select a.name , a.English_marks, b.Maths_marks
from English as a, maths as b
where a.name =* b.name;                              /*   read  =* as  sounds like     */
Quit;

Another way : Using Soundex function

Proc Sql;
select a.name , a.English_marks, b.Maths_marks
from English as a, maths as b
where soundex(a.name) = soundex(b.name);
Quit;


Soundex Function generates Sound Code for string which is the used for comparison.

Let's delve a little on this code generation algo.


How Soundex works :   learn by example : LETTERHEAD

1) Deletes any W’s or H’s from the string.


LETTERHEAD   -- LETTEREAD

2) Assigns a number to each letter in the string, according to the following chart:
                a. A, E, I, O, U, Y                  – 0            no weightage to vowels 
                b. B, F, P, V                          – 1
                c. C, G, J, K, Q, S, X, Z        – 2
                d. D, T                                  – 3
                e. L                                       – 4
                f. M, N                                 – 5
                g. R                                      – 6


So we have LETTEREAD = 403306003

3) If two or more adjacent letters have the same number, then delete all but one.


            403306003 = 4030603

4)  If it is still there, delete the first letter in the name
                                                                    4030603  =    030603

5) Delete all remaining occurrences of A, E, I, O, U, and Y (actually 0s now)
                                                                      
                                                                        030603  = 363


6) Now, your Soundex code is ready ! It is first letter with the code derived as a result of above 5 steps.                                                                
                                                                                L363


Now you know sound based matching and  fuzzy matching of text. Blend these and you can do wonders during text mining standardization exercises.


Enjoy reading our other articles and stay tuned with ...

Kindly do provide your feedback in the 'Comments' Section and share as much as possible.



No comments:

Post a Comment

Do provide us your feedback, it would help us serve your better.