Text Mining in R - Part 2

Make a Word Cloud in R

In the previous article of the series, we have learned web scraping of data from Twitter. We would now use the same to create a word cloud, but before that we would learn few basics of text mining such as cleaning operations which are required every time we do any text mining exercise.

Let's start building the concepts.



In the previous article we did learn web scraping of data and then we fetched tweets from Twitters, let's now use the same in text mining.

Let's see what's happening today at Twitter, I saw that trending Hashtag today are:

I am picking up top Hashtag #NamoAtSimhastha and let's analyze what people are talking about.

# Extracting 500 tweets just to demonstrate
Namo_tweets = searchTwitter("#NamoAtSimhastha", n=500,lang="en",  since = "2016-05-14", )

# Just to check few tweets
head(Namo_tweets)
Namo_tweets[[1]]

# Extracting text portions from Tweets extracted
tweets <- sapply(Namo_tweets, function(x) x$getText())

# Now it is time to install text mining package (tm) which we will use to process/clean the data

install.packages("tm")
require(tm)

# In order to process any text, we first make the a corpus of the same. The corpus can be permanent as well as temporary. Here we are making a temporary one, you should also learn how to make a permanent corpus.

textCorpus = Corpus(VectorSource(tweets))

# It is time to clean the corpus text

# 1. convert all text to lower case so that CAT , Cat, cat all become same as cat
textCorpus = tm_map(textCorpus, content_transformer(tolower))
# 2. we remove all the punctuations
textCorpus = tm_map(textCorpus, removePunctuation)
# 3. all regular english words are removed **
textCorpus = tm_map(textCorpus, removeWords, stopwords("english"))
# 4. more regular english words are removed **
textCorpus = tm_map(textCorpus, removeWords, stopwords("smart"))
# 5. all the numbers are removed
textCorpus = tm_map(textCorpus, removeNumbers)
# 6. we removed extra blanks, well it should always be the last operation, as above operation might generate white space
textCorpus = tm_map(textCorpus, stripWhitespace)

#Just for learning about transformations
?getTransformations

# Finally we convert the corpus into a Plain text document
tcorpus = tm_map(textCorpus, PlainTextDocument)


# To create a wordcloud , we need to install a package called wordcloud
install.packages("wordcloud")
library(wordcloud)

# Let's first create a random ordered purple colored 
wc = wordcloud(tcorpus,scale = c(5, 0.8), 
     rot.per=.25, colors="purple", random.order=T,random.colors=F,min.freq=4)


# Let's now make it multicolored.

library("RColorBrewer")
col =  brewer.pal(5,"Dark2") 

 # you can learn more about the package and try various color options

wordcloud(tcorpus, min.freq=4, rot.per=0.25, scale=c(5,1),
          random.color=T, random.order=T, colors=col)




The size of word is directly proportional to frequency of the term in corpus.

# ** to check the stopwords lists please run following commands
 stopwords("english")
 stopwords("smart")

This is not it, we will come up with more examples and variation soon in next articles of this series, till then ...

Enjoy reading our other articles and stay tuned with us.

Kindly do provide your feedback in the 'Comments' Section and share as much as possible.


A humble appeal :  Please do like us @ Facebook




No comments:

Post a Comment

Do provide us your feedback, it would help us serve your better.