Hidden Gems of SAS - 2
Supercalifragilisticexpialidocious, the is the largest possible adjective I could find while I got to learn about this hidden gem of SAS : PROC SPELL. Learn it and believe me if you ever need to work in text mining, it would make your life so easy.The pic is no exaggeration for this SAS procedure!
Let me share one of my past experience to set a context.
The story is little boring, but it is worth listening to:
Once upon a time ... in one of my previous organizations, I got to work in a Text Mining project where we were dealing the worst possible text ever. The text data was ]about all possible commodities/things on earth and was full of all possible typos /spelling variations.
Example: For the word "Industry", these were all possible correct and incorrect variants such as: "Industries", "Industrys" "industri", "indastry", "industree " ...... so on and we were suppose to do spelling correction first for all the words and bring them to basic word "Industry".
I would not like to steal the credit as there used to be an excellent SAS programmer in the team (I never had a chance to meet with her), who made a 500+ lines algorithm that used to accomplish the herculean task.
Basically, the algorithm was parsing all the text into words and then a huge cartesian product of the words' list was being prepared. Then all the similar words were being identified on the basis of "Spelling Distance" and "Phonetic Similarity". The root word was being identified on the basis of mode i.e. maximum occurring word was being considered as ROOT.
For learning "Spelling Distance", read:
Spelling distance based matching (Spedis, Compged and complev functions)
For learning "Phonetic Similarity, read:
Sound based matching (Soundex)
Though the logic didn't fail in most of our test cases, but there was a flaw in the algorithm : What if the maximum occurring word itself is mis-spelled? Hence the idea was not full proof (No offense to any one, and I mean it. I really pay my best respect to the person who wrote that code). Also the execution of the code used to take lots of time for a large data.
I am not blaming anyone, neither I am saying that the algorithm is useless, in fact the same would be required, even when you use Proc Spell. The only thing that I want to emphasize here is that most of us are not aware of this beautiful and powerful package : Proc Spell ... and idea that I want to covey is that the algorithm (referred above) can be improved with the help of this package.
A great person has once said :
Let's see how the Proc Spell works :
First create a misspelled words.txt file with following content:Industries understand special traininng needs
Industry understand special training needs
Industrys understund special traininng needs
industri understand spesial training needs
indastry usderstand special training needs
industre undarstund special trainng needs
You should not trast anyone blindly
/* Let's now import the file into SAS. */
%let location = G:\AA\SAS gems;
libname AA "&location.";
filename sample "&location.\misspelled words.txt";
/* In the first step, we try to create a catalogue of words in the file */
Proc SPELL words = sample
Create dict = AA.mycatgalog.Spell;
Run;
/* Now initiate a file for accommodating required output */
/* Now with the help of Proc Spell, we try to identify the misspelled words and seek suggestion to correct those, and take output in the above initialized file */
dictionary = AA.mycatgalog.Spell
verify suggest;
run;
Proc Printto print = print; Run;
/* Open the output file to understand the output of Proc Spell, let's get the output back into SAS */
Data AA.List_correction;
infile "&location.\output.txt" missover firstobs = 7 ;
input A & $1000. ;
Run;
/* Looks like */
/* Transform the output file into readily usable form */
Data AA.List_correction;
set AA.List_correction;
retain id 1;
if A = "" then id +1;
Run;
Proc transpose data = AA.List_correction out = aa.transposed;
by id;
var A;
where A ~="";
Run;
data aa.transposed;
length suggested $1000.;
retain id original_word suggested;
set aa.transposed (drop = _name_);
rename Col1 = original_word;
suggested = scan(Col2,2,":");
drop col2;
Run;
Data aa.transposed;
retain id original_word suggested;
set aa.transposed;
run;
... and here we are with a list of incorrect words with suggested correction. For few words, we might not get any and for others, we might get more than one, Now to it is time to build the further algorithm to replace wrong word with the most appropriate corrected word. You can build a macro and use tranwrd function to replace the word.
Enjoy reading our other articles and stay tuned with us.
Kindly do provide your feedback in the 'Comments' Section and share as much as possible.
thanks admin HDE Bilişim
ReplyDeleteAlışveriş
Compo Expert
Multitek
Seokoloji
Vezir Sosyal Medya
Adak
Maltepe Adak
The Evolution of the Casino, the City and the Wild West - Dr.
ReplyDelete› the-casinow-and-wild-west- › the-casinow-and-wild-west- Nov 23, 2017 — Nov 23, 2017 The evolution of the Casino, the City and 경주 출장안마 the Wild West is here at the West Coast, The evolution of 제천 출장샵 the Casino, the City and the 평택 출장마사지 Wild West 여주 출장안마 is here at the West Coast, The evolution of the Casino, the City and the Wild West is 광주광역 출장마사지 here at the West Coast,
perde modelleri
ReplyDeleteSms Onay
MOBİL ÖDEME BOZDURMA
nft nasıl alınır
ankara evden eve nakliyat
TRAFİK SİGORTASI
Dedektör
HTTPS://KURMA.WEBSİTE/
ask romanlari
Eminent . Kindly continue to compose more on this subject . I need more material on this point. What is the Kenya visa cost for US citizens ? The visa expenses for Kenya are no different for all nations . It is just impacted by the sort of e visa which one you select. .
ReplyDeletesmm panel
ReplyDeleteSMM PANEL
iş ilanları
İnstagram takipçi satın al
hirdavatciburada.com
beyazesyateknikservisi.com.tr
servis
tiktok jeton hilesi
Amazingly unimaginable really, these blogs are very attractive. How to apply e visa India? Apply online , pay online and get your visa online in your updated email. Id.
ReplyDeleteThis comment has been removed by the author.
ReplyDeleteGood content. You write beautiful things.
ReplyDeletevbet
taksi
sportsbet
mrbahis
vbet
sportsbet
korsan taksi
mrbahis
hacklink
This post is on your page i will follow your new content.
ReplyDeletemrbahis
sportsbet giriş
sportsbet
betgaranti.online
sportsbetgiris.net
casino siteleri
mrbahis giriş
mrbahis.co
sportsbet
Toptan vozol için buraya tıklayın: toptan vozol
ReplyDeleteçeşme
ReplyDeletemardin
başakşehir
bitlis
edremit
OOZPAİ
kuşadası
ReplyDeletelara
sivas
çekmeköy
fethiye
TKG
van
ReplyDeleteerzincan
sivas
ağrı
manisa
MFFTY
kırklareli evden eve nakliyat
ReplyDeleteısparta evden eve nakliyat
istanbul evden eve nakliyat
ankara evden eve nakliyat
kırıkkale evden eve nakliyat
682X17
6525A
ReplyDeleteEskişehir Lojistik
Rize Lojistik
Muğla Lojistik
Karaman Lojistik
Samsun Parça Eşya Taşıma
06BB7
ReplyDeleteAdana Goruntulu Sohbet
bolu sohbet odaları
rize rastgele görüntülü sohbet
parasız görüntülü sohbet uygulamaları
Konya Telefonda Sohbet
Muş Rastgele Görüntülü Sohbet Ücretsiz
karaman telefonda rastgele sohbet
ığdır Kadınlarla Görüntülü Sohbet
Erzurum Kadınlarla Ücretsiz Sohbet
0B053
ReplyDeleteBinance Yaş Sınırı
Parasız Görüntülü Sohbet
Ceek Coin Hangi Borsada
Tiktok Beğeni Satın Al
Bitcoin Kazanma
Kaspa Coin Hangi Borsada
Vector Coin Hangi Borsada
Bitcoin Kazanma
Osmo Coin Hangi Borsada
531E9
ReplyDeletetelegram kripto grupları
btcturk
telegram en iyi kripto grupları
referans kimliği
canli sohbet
bitexen
poloniex
en eski kripto borsası
mobil 4g proxy
BFD1F
ReplyDeletebitcoin hangi bankalarda var
cointiger
kripto para kanalları telegram
probit
bitcoin nasıl kazanılır
kripto kanalları telegram
probit
canlı sohbet ücretsiz
referans kimligi nedir
34C71
ReplyDeletetelegram kripto para
bitcoin ne zaman çıktı
kredi kartı ile kripto para alma
en iyi kripto para uygulaması
bitget
vindax
mexc
kripto telegram grupları
binance referans kodu
B1306
ReplyDeletewhatsapp görüntülü show güvenilir
F5C80
ReplyDeletegörüntülü şov whatsapp numarası