Text Mining in R - Part 5

Sentiment Analysis in R - First Method


The article covers Sentiment Analysis, which is the most important subset of Natural Language Processing (NLP), in R . Here we are explaining a basic method for performing Sentiment Analysis and will cover two more evolved methods real soon.  The article will not only clear all our basics,  but also set up a milestone for next two articles.

Additionally, the article also covers an important feature of R : "How to define a functions in R?" and its applicability. In a nutshell, the article is too important to miss!
Natural Language Processing (NLP), especially Sentiment Analysis, is quite sensational subject these days in the world of data and analytics. Things are evolving in the form of new algorithms and much more, but most of these are similar in  principle. We need to learn the principles behind Sentiment Analysis and then we can go ahead and learn various complex things.



Guys, you are being watched and being analyzed for everything you write on social media. Be it your review about a movie or a product, concerned organization are eager to know about "How are you feeling about their products?" Sentiment Analysis has become a subject of prime importance. A lot of research is going on around the world, every new day comes with an evolution in the algorithms for Sentiment Analysis. We would catch up with the evolved algorithms, but for that we need to make our basics strong.

Before we jump to Sentiment Analysis we would first learn defining functions in R. 

How to define a function in R

In R user can write a sequence of command and then make them part of function for making those handy for re-usage. Let's see a very basic example :


something = function (a,b)
{
  x =  a+b
  y = x/5
  return(y)
}

# We have defined a function "something" which needs two arguments (a and b). we do some calculations and then we want to get "y" as an output, hence we return it.
    
# Let's now see, how to use the function defined above

    xyz = something(15,5)

# Result in 4 in Global Environment
#-----------------------------------------------------------------------------------------------------------#
Let's come back to Sentiment Analysis, we would use function facility within this article.

Basics of Sentiment Analysis

Suppose someone says to you "You are bad and ugly" ,You start feeling sad or angry about it. 

Right?

How do you that person is talking negative about you? Your mind processes the sentence and then tags the words that hint toward some positive/negative emotion. Here BAD and UGLY are the two negative words.

Well, we say that you are nice and beautiful, so start smiling, as we have used two positive words for you.



Now the way our mind processes language naturally, algorithms are being designed to train the computers for processing the text similarly. Finding it exciting? Let's now write a basic algorithm for Sentiment Analysis.
Hu and Liu

First we need to make a list of positive and negative words. Well, there are too many negative and positive words in English . 

Don't worry, you need not do this herculean task, as Hu and Liu have already compiled the list for us. Kudos to these guys.

Generally people download the lists of positive and negative words compiled by Hu and Liu and then read in R, but here is trick to extract and use the list inbuilt within a package.


rm(list = ls())
# Install package qdap if not already installed
if (!require(qdap)) install.packages("qdap")
library(qdap)
all_words =as.data.frame(key.pol)
Pos_words = all_words[all_words$y == 1,1]
Neg_words = all_words[all_words$y == -1,1]
rm(all_words)

# You can see, that you now have two list of words in R environment : 4776 negative words and 2003 negative

'


Lesson learned : World is full of negativity!


We now define two functions that we will use in the Sentiment Analysis.

First function for data cleaning and second function for sentiments mining


# Building Data Cleaning Function


NAMO = function(x)
{
  y = tolower(x)
  y = gsub("@\\w+", "", y)
  y = gsub("[[:punct:]]", "", y)
  y = gsub("http\\w+", "", y)
  y = gsub("\\d+", "", y)
  y = gsub("[^\x20-\x7E]", "", y)
  y = gsub("^\\s+|\\s+$", "", y)
  return(y)
}

# In the function NAMO, we have removed @ tags, punctuation, wesbsite links, digits, non English characters and leading and trailing spaces. 

# We have named the function NAMO as a gesture of being thankful to honorable prime minister of India Mr. +Narendra Modi for promoting Clean India Campaign.



# Building Sentiment Mining Function

Find_emotions = function(x, Pos, Neg)
{
if(!require(plyr)) install.packages("plyr")
  library(plyr)
  senti_score = laply(x, function(x, Pos, Neg)
  { 
if(!require(stringr)) install.packages("stringr")
    library(stringr)
   
# step 1 : parses the sentence into words and gives a list, list is then being unlisted 
    word.list = str_split(x, "\\s+")
    words = unlist(word.list)
# step 2 : the words are then matched with positive and negative list of words one by one 
    pos.matches = match(words, Pos)
    neg.matches = match(words, Neg)
# step 3 : Gives binary output for positive and negative word matches
    pos.matches = !is.na(pos.matches)
    neg.matches = !is.na(neg.matches)
# step 4 : subtract number of negative words from number of positive words found
    senti_score = sum(pos.matches) - sum(neg.matches)
    return(senti_score)
  }, Pos, Neg)
}
# We recommend user to learn about plyr and stringr packages, as these packges are quite useful for text mining. 

# Let's understand the function Find_emotions. It contains there are 3 arguments :

1. x -- the text on which we need to perform sentiment analysis
2. Pos -- List of positive words ( in this case Pos_words)
3. Neg -- List of negative words ( in this case Neg_words)

Let's now see how the 4 steps in the function are working, please run the following command one by one and keeping checking 

if(!require(stringr)) install.packages("stringr")
library(stringr)
x = "He is a good bad and ugly guy"
word.list = str_split(x, "\\s+")
words = unlist(word.list)
pos.matches = match(words, Pos_words)
neg.matches = match(words, Neg_words)
pos.matches = !is.na(pos.matches)
neg.matches = !is.na(neg.matches)
senti_score = sum(pos.matches) - sum(neg.matches)

# you find result as -1 ... How ?

#Since in the string there are two negative words and only one positive word, 
      
 senti_score =  number of positive words - number of negative words = 1-2 = -1 

Hope logic is now clear to you. Let's now use the functions on a datsets and check the result. laply has been used to apply the function on each and every observation independently.

It is now time for some real action!

Download the csv file using the following link, we would it for Sentiment Analysis.

Same file we have used for counting words' frequency in 4th article of text mining series:


# Read the csv first
setwd("G:/AA/Text Mining")
data_1 = read.csv("movies_reviews.csv", header = T)

# First clean the data using NAMO function
data_1$clean = NAMO(data_1$Review)


# Let's calculate the scores
data_1$score =Find_emotions(data_1$clean, Pos_words, Neg_words)

Voila! It's done.

You can see that a new column "Score" is populated in the dataframe data_1, which if >0 then depicts positive emotions and if <0 then negative emotions. Sometimes it might be zero as well, calling it a neutral comment.

For example we take second review :

superb moviethis is one of my all time favourite moviestom hanks has outshone himself in this great movie the directioncameraediting are all awesomelove this movie

Score is 3 because of 3 positive words with correct spelling. Here "awesome" and "love" got concatenated because of  punctuation substitution with NULL. had we substituted it with a space, score here would have been 5. 

try following version in place of earlier one and check the difference. This version fails to remove website links though. You here need to take a call. I feel that in Sentiment Analysis, below version is better than the first version.

NAMO = function(x)
{
  y = tolower(x)
  y = gsub("@\\w+", "", y)
  y = gsub("[[:punct:]]", " ", y)
  y = gsub("http\\w+", "", y)
  y = gsub("\\d+", "", y)
  y = gsub("[^\x20-\x7E]", "", y)
  y = gsub("^\\s+|\\s+$", "", y)
  return(y)

}

Hope we have elaborated enough about it. Guys, do exercise to become an expert in the subject.

Make a csv files with some product's or movie reviews yourself. Take PINK shaded code from this blog, make necessary changes such as "file path", "file name", "column name" , execute code and then try to analyse the results.

Use both variants of NAMO functions one by one.

Now while we have learnt the first method, let's now understand the demerits of the same. It would motivate us for learning the more evolved one. Also try to plot the scores, get averages etc to interpret the scores.

The method is fine in principle, but doesn't take care of following stuff :


For this method,  GOOD and VERY GOOD are same, while they are not.

For this method GOOD and NOT GOOD are same, while NOT GOOD means negative.

These things would be taken care in the next episode of Sentiment Analysis, till then ...


Enjoy reading our other articles and stay tuned with us.

Kindly do provide your feedback in the 'Comments' Section and share as much as possible.


Related Articles :

Text Mining in R - Part 1