Text Mining in R - Part 4

Few more pieces about Text Mining in R

So far we have learn about scarping data from Twitter, basic word cloud and  then advanced word clouds before we go ahead. Also we covered basic terms such as TDM, DTM, Corpus etc.. Before we proceed ahead, we would like to cover a few things thing that we left uncovered intentionally just to keep the topic simple.

Let's strengthen our basics !
We could explain these concepts using Twitter's data, but then in industry, you don't use such data. You actually need to deal with data in flat/.csv file or a database.

Here we are picking up data from a .csv file ::  Please download the file

Data contains reviews of two movies of Steven Spielberg, one of which is my all time favorite movie "The Terminal".

The file contains a column Review that needs to be analyzed.

Now let's understand few concepts :

Let's count the frequencies of words

# Being a Data artist, I love have a fresh and clean canvas while painting
rm(list =ls())
# Let's import the data
setwd("G:/AA/Text Mining")
data_1 = read.csv("movies_reviews.csv")

# I have inserted @tags and website site link in first review just to make it similar to tweet examples

#Let's first clean the reviews now
data_1$Clean = tolower(data_1$Review)
data_1$Clean = gsub("@\\w+", "", data_1$Clean)
data_1$Clean = gsub("[[:punct:]]", "", data_1$Clean)
data_1$Clean = gsub("http\\w+", "", data_1$Clean)
data_1$Clean = gsub("\\d+", "", data_1$Clean)
data_1$Clean = gsub("[^\x20-\x7E]", "", data_1$Clean)
data_1$Clean = gsub("^\\s+|\\s+$", "", data_1$Clean)
data_1$Clean = trimws(data_1$Clean) 
#  the last two command are same as both removes leading and training blanks, you can use only on of these
# You can add more of these cleaning steps 

# Now let's make a corpus and then Document-Term Matrix.
library(tm)
clean_Corpus = Corpus(VectorSource(data_1$Clean))

# You can try running the whole code with and without the following line and see how the result change
Click to enlarge
clean_Corpus = tm_map(clean_Corpus, removeWords, stopwords("english"))

dtm = as.matrix(DocumentTermMatrix(clean_Corpus))
sorted = sort(colSums(dtm), decreasing = T)
terms_wid_freq = data.frame(word=names(sorted),freq=sorted)

and we are, we have all the words with their frequencies >>>>

But you can see that term "movie" and "movies" together has highest occurrence (11+10 =21), but the slight variation has caused "spielberg" to be on the top.

Lesson learned : United we stand, divided we fall !

But suppose, we want to consolidate all the similar words together, like "movie" and "movies" ... we can do it by stemming the documents.


Stemming of terms in documents

# try the following code with an extra command

library(tm)
click to enlarge
clean_Corpus = Corpus(VectorSource(data_1$Clean))
clean_Corpus = tm_map(clean_Corpus, removeWords, stopwords("english"))

clean_Corpus = tm_map(clean_Corpus,stemDocument )


  dtm = as.matrix(DocumentTermMatrix(clean_Corpus))
  sorted = sort(colSums(dtm), decreasing = T)

  terms_wid_freq = data.frame(word=names(sorted),freq=sorted)

if we check the result now >>>>>>>>>>>>

Few of us, might now like it, as the word "movie" is "movi" now, but then it helps sometimes. You can use it at your own discretion during corpus cleaning and try to make a comparison cloud again as demonstrated in previous article.

I am quite excited to write about the next article now, as I have been waiting for it for quite a long.
The next article in in the series would be on Sentiment Analysis, till then ...

Enjoy reading our other articles and stay tuned with us.

Kindly do provide your feedback in the 'Comments' Section and share as much as possible.


A humble appeal :  Please do like us @ Facebook



No comments:

Post a Comment

Do provide us your feedback, it would help us serve your better.