Ask Analytics: Text Mining in R

Sentiment Analysis in R - Coolest Method So Far

So far we have discussed all the basics, a rudimentary method, an evolved method and a cool way to visualize Sentiment Analysis. Let's now explore one of the most evolved methods that I have found out while learning Text Analytics. It took me lot of time to learn about it, but it won't take much of your time ... coz Ask Analytics has made it easy!

Ask Analytics recommends step wise learning, hence please go through following articles as well before you start reading this article.

Related articles:

Text Mining in R - Part 1

Text Mining in R - Part 2

Text Mining in R - Part 3

Text Mining in R - Part 4

Text Mining in R - Part 5

Text Mining in R - Part 6

For earlier methods, we used a .csv file, let us now try scoring twitter data in this exercise.

We have already covered, how to scrape and prepare data from Twitter in one of the previous articles @ Ask Analytics.

# Let's first scrape the twitter data for particular #tag

Consumer_key<- "hw1U_______________XoF7dHti"
Consumer_secret <- "tlnGbRKVIkjW___________________oJFmWmgcmPwokruaQ"
access_token <- "3154348417-VsEmRQgp_______________6G3dPFJ3Q3uO"
access_token_secret <- "G258EyQ6______________________95XH2RzoxsC"
setup_twitter_oauth(Consumer_key,Consumer_secret,access_token,access_token_secret)

# Sorry, but I cannot share my credential, you need to get you know. To know how to get these, please follow our first blog on text mining in R :

if (!require(twitteR)) install.packages("twitteR")
library(twitteR)
setup_twitter_oauth(Consumer_key,Consumer_secret,access_token,access_token_secret)

# We have started R's engine, it is now time for action. We have chosen the today's trending #tag for analysis purpose.
roots = searchTwitter("#TanmayRoasted", n=150,lang="en", since = "2016-05-30" )

# let take a look on top 6 tweets
head(roots)

# we just want the text part from the tweets for analysis
tweets <- sapply(roots, function(x) x$getText())

# Let's define NAMO function for text cleaning
NAMO = function(x)
{

y = gsub("[^\x20-\x7E]", "", x)
y = tolower(y)
y = gsub("@\\w+", "", y)
y = gsub("[[:punct:]]", " ", y)
y = gsub("http", "", y)
y = gsub("www", "", y)
y = gsub("\\d+", "", y)
y = gsub("^\\s+|\\s+$", "", y)
return(y)
}
# Let's clean the text now using NAMO function
clean_tweets = NAMO(tweets)

# check the cleaned tweets text now
head(clean_tweets)

# Now we need to install set of packages that are required for the third type of Sentiment Analysis. Here I am installing few supporting packages first while main package being {sentiment}. Please follow the warnings in your console, you might require to install more packages if it asks for.

if(!require(devtools)) install.packages("devtools")
if(!require(Rstem)) install.packages("Rstem")
if(!require(slam)) install.packages("slam")
require(devtools)
require(Rstem)
require(slam)
# Now installing main package
#install_url("http://cran.r-project.org/src/contrib/Archive/sentiment/sentiment_0.2.tar.gz")
require(sentiment)

# Now following two commands will do all the magic !
emotion = classify_emotion(clean_tweets, algorithm="bayes", prior=1.0)
polarity = classify_polarity(clean_tweets, algorithm="bayes")

#Voila! It's done.

Let;s look at the results and then we can tweak these as per our requirement.

So as a result of all the above code, you get two data sets : emotion and polarity

emotion

As you see here, Bayes techniques check for various emotions : Anger, Disgust, Fear, Joy, Sadness and Surprise in the string and then gives the best fit on the basis of highest score in the row.
It sometimes is quite indecisive e.g. in Line 1. But then you can further use IF ELSE logic to make it "Disgust" as the Disgust score is highest in the row.

Hope the concept is quite clear now. The second result is polarity, which you know already from our previous blogs.

polarity

Here also, you can take your own Positive/Negative ratio cut off to better decide the sentiments.

And that is IT.

Now here is some GYAN, I would like to share.

1. Now while you know all the techniques, you can learn more yourself easily, but remember, principle remains same, mostly.

2. There might be few Sarcastic texts and these would add to your error. It is not much possible to deal with it.

3. Don't be very judgmental while doing twitter analytics, as especially in India, now many people use Twitter. Also the population using Twitter is not the true representative sample of India always.

4. People use slangs. abbreviations, acronyms often in the Tweets that carry their respective emotions, but it is hard to detect here.

All right then ...

Humble appeal: