Ask Analytics: Text Mining in R

Comparison and Commonality Cloud and much more

In the previous articles of the series, we covered web scraping and basics of text mining. We have also covered basic word cloud. Now it is time to learn some very useful text functions, web scraping variants. We will also learn "How to create comparison and commonality type of word cloud" and would learn to analyse the same.

With reference to first article of the series : Text Mining in R - Part 1

We earlier earlier learned how to extract tweets on a particular Hash Tag. What if ...

Q. Can we need to scrape tweets based on two hash tags occurring together?

Ans. Yes is it very much possible, use :

xyz = searchTwitter("#MannKiBaat AND #NAMO", n=50)

Q. Can we scrape tweets from specific users timeline, instead of hashtag basis?

Ans. Well Yes. Example is within the article.

Hash tag based scraping is done when we wants to know opinion of people about certain topic, timeline based scraping is done to know what a person/institution is up to. But we should learn both.

I was planning to buy a new mobile connection and I was supposed to choose one from Airtel and Vodafone. I thought, I should analyze these companies behavior on Twitter and then see it is helpful in decision making. In this exercise, I got to learn two things : Comparison Cloud and Commanlity Cloud

#---------Let's first make the connection between R and Twitter ---------------------#

Consumer_key = "6NY7fDv___________QDT6WtrDK2p"
Consumer_secret = "6R06rlKb5LEy3yIb_____________HChZCBzXvgXXHV8V6oZC"
access_token = "3154348417-u0al6vBfU___________YFQwjQJIjQHeMErdJVI"
access_token_secret = "0fZ5WxRDNfH________________tsqAtIkhAC0NQQaSVWx"

# I have masked my credentials, you need to get your own ( If you don't know where you can get it from, I believe you have missed the first blog on the Text Mining series.

if(!require(twitteR)) install.packages("twitteR")
library(twitteR)
setup_twitter_oauth(Consumer_key,Consumer_secret,access_token,access_token_secret)
rm(list = ls())

#------------------CONNECTION DONE------------------------------------#

# We would now fetch tweets from Airtel and Vodafone India timeline

#Twitter name for Airtel India : airtelindia
#Twitter name for Vodafone India : VodafoneIN

airtel_tweets = userTimeline("airtelindia", n=500, since = "2016-01-01")
vodafone_tweets = userTimeline("VodafoneIN", n=500, since = "2016-01-01")

# we now get the text part of the tweets from both the extracts
airtel_tweets = sapply(airtel_tweets, function(x) x$getText())
vodafone_tweets = sapply(vodafone_tweets, function(x) x$getText())

# We now need to clean the extracted text, we will perform cleaning in two phases

# Cleaning Phase 1 -- Twitter specific cleaning -- For this we define a function
# gsub is very useful function, do learn about it. We would write about it soon.

clean.twitter = function(x)
{
# remove @ taggings
x = gsub("@\\w+", "", x)
# remove punctuations
x = gsub("[[:punct:]]", "", x)
# remove links which starts with http
x = gsub("http\\w+", "", x)
# remove tabs
x = gsub("[ |\t]{2,}", "", x)
# remove blank spaces at the beginning
x = gsub("^ ", "", x)
# remove blank spaces at the end
x = gsub(" $", "", x)
# remove non english characters
x = gsub("[^\x20-\x7E]", "", x)
return(x)
}

# Now we shall use the function defined above
airtel = clean.twitter(airtel_tweets)
vodafone = clean.twitter(vodafone_tweets)

# Let us now make the consolidated vectors with all the tweets related to one entity together

airtel_1 = paste(airtel, collapse=" ")
vodafone_1 = paste(vodafone, collapse=" ")

# and now make it one vector, by putting everything in a single vector
The_one = c(airtel_1, vodafone_1)

# Cleaning Phase 2 -- Generic cleaning, which is done by using tm package functions

if(!require(tm)) install.packages("tm")
library(tm)
corpus = Corpus(VectorSource(The_one ))

textCorpus = tm_map(corpus, content_transformer(tolower))
textCorpus = tm_map(textCorpus, removeWords, stopwords("english"))
textCorpus = tm_map(textCorpus, removeNumbers)
textCorpus = tm_map(textCorpus, stripWhitespace)

# Post cleaning, it is time to create Term Document Matrix. Well there are two such matrices can be made using tm package :

1. Document Term Matrix (DTM) : A document-term matrix matrix that describes the frequency of terms that occur in each and every document. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms.

2. Term Document Matrix (TDM) : Similar to DTM but transpose of it . In a Term-Document matrix, rows correspond to terms in the collection and columns correspond to documents.

# Back to code

tdm = as.matrix(TermDocumentMatrix(textCorpus))
head(tdm)

# It looks like picture in right , we now give name to column 1 and 2, as per their respective entity

colnames(tdm) = c("Airtel", "Vodafone")

Now we shall make the word cloud of two types ( Basic type we have already learnt previously):

Comparison Cloud : Used to check the contrast between two text corpus
Commanlity Cloud : Used to check the common term across various corpus

# Let's make a comparison cloud

if(!require(wordcloud)) install.packages("wordcloud")
library(wordcloud)
# comparison cloud
comparison.cloud(tdm, random.order=FALSE,
colors = c("#00B2FF", "red"),
title.size=1.5, min.freq=100, max.words=500)

We can see Airtel twitter handle is mostly talking about its product, plans, features or events, Vodafone on the other hand is mainly replying to unsatisfied customers. Especially this guy Amit is writing most of their tweets.

Thoughts that came to my mind : EITHER Airtel has got less complaints, while Vodafone has got too many of those, OR Vodafone is more focused towards customer satisfaction and hence it is using its Twitter handle to reply to customers complaints unlike Airtel, who is using it for advertising its products.

One thing is sure, If I take vodafone connection, I would need to talk to this Amit one day.

# Let's now make a commanlity cloud

commonality.cloud(tdm, random.order=FALSE,
colors = brewer.pal(8, "Dark2"),
title.size=1.5)

Commanlity Cloud gives an idea about what common terms two (or more) entities are using. In this case, there is nothing much that can be interpreted.

What's next in the series :

We are going to cover few more functions of tm package, text association, sentiment analysis and much more, till then ...

Enjoy reading our other articles and stay tuned with us.

Kindly do provide your feedback in the 'Comments' Section and share as much as possible.

A humble appeal : Please do like us @ Facebook

Pages

Text Mining in R - Part 3

Comparison and Commonality Cloud and much more

No comments:

Post a Comment