A Definitive Guide to Twitter Analytics using R

social media, word cloud, marketing

A Definitive Guide To Twitter Analytics Using R is an in-depth guide that touches text data mining techniques, Natural Language Processing, and Clustering methods to extract the insights.

Twitter analytics is one of the most powerful methods applied to the voice of customers such as product reviews, movie reviews, online surveys, brand marketing, public reaction to major events related to politics, sports, and many other areas.

In the last couple of years, social media has been an influential place to bring out your emotion. Many researchers have been analyzing the tweets by citizens of a nation on Twitter which is a microblogging website where users read and write millions of tweets on a variety of topics on daily basis.

In this post, you will learn to analyze tweets, profile analysis, data pre-processing techniques, and application of natural language processing and advanced machine learning techniques. You can perform such detailed analysis using R, Python, or any other advanced language. 

R programming language is widely used for statistical computing, data mining, text analytics, natural language processing, and building machine learning models. Due to various advantages, will be using R language and IDE RStudio to perform this study.

So, let us get started.

Step 1: Install - R Programming And RStudio

The most important step is to set up your machine correctly, so let’s begin to follow the installation guide for R programming language and Rstudio.

If you have not downloaded R then follow along to complete R software installation and get started with your journey to twitter analysis. I am using windows 10 machine, for mac users, some of the steps may not relevant.

Installing R Programming Language

Firstly, download the R language tool and complete the installation. You can refer image in case not sure how to proceed.

R programming language

Just search on google and click on the first link. Based on your operating system, download the relevant R version, and complete the installation. I am using windows so will be download R 4.0.2 for windows

R programming language

Installing RStudio

RStudio is an integrated development environment (IDE) for R programming language. R Studio is available in two formats: Desktop and Server version which runs on a remote server.

R Studio offers a user-friendly environment to manage R projects and comes with many other friendly features that make learning quite fun.

RStudio Download

Download the RStudio Desktop version which is free and comes with all the necessary functionality required to complete this project.

Select the RStudio version, download and install it.

Install R Studio

Go to windows, you will find the RStudio icon over there. This completes the RStudio installation.

Step 2: Setting Up Twitter App

Create Twitter Account And login

The next important step is to create a twitter app, twitter has provided APIs to interact with the application. To create a developer account, you need to sign up for a Twitter account. Go to twitter.com and create your account. Also update the contact detail, without verifying you can create a developer account.

Twitter Developer Login

Now, head over to twitter developer site using your twitter credential. If you logged in for the first time, you do not see any apps there on the home page. Once you logged into the developer account, click on the apps on the top right of the page.

twitter developer login

Create Twitter App

It shows “No apps here”. Now, click on the Create an app on the top right of the page.

twitter app

Twitter has updated the app creation process, options have changed. Select “Academic” and proceed to the next page. 

twitter app 1

There is a couple of questions that you need to answer before getting it to the final page. Make sure you clearly state how you are going to consume the API, and what services you need it to perform your study.

At last, proceed to the next page you will receive the confirmation email to complete the application process. You have successfully applied for a twitter developer account.

Twitter App Keys And Secrets

To access twitter APIs, you need to generate the keys and secrets. For example, I have provided machinelearningsol as my app name here.

twitter get keys

You can access your app page; there you can find the required keys and secrets. Keys and secrets are confidential information make sure you do not share these with anyone else.

When you logged in for the first time, you may need to generate the Access Token and Secret. 

twitter app keys and secrets

Congratulations. You have successfully created your twitter development account and app. Now, you will be good, to begin with, your twitter analysis project.

Now you have all pre-requisites to begin your journey to learn Twitter Analytics.

Step 3: The Tweet And Its Structure

Before starting twitter analytics, it is important to understand the tweets and its component before deep dive into this study. Just a quick summary over what we information received through twitter API. Do not worry will look into the R code, and try to relate everything.

A tweet is a social media message posted on Twitter.com. It is restricted to 140 characters through twitter is currently experimenting with doubling the length of the tweet from 140 characters to 280 characters. Though most tweets contain mostly text, it is possible to embed URLs, pictures, videos, vines, and GIFs.

Tweets contain components called hashtags, which are words that capture the subject of the tweet. They are prefixed by the ‘#’ character. Usernames or handles of those who post are recognized by the ‘@’ symbol. A user can direct a message to another user by adding the handle, with the ‘@’ symbol.

A retweet (‘rt’ for short) is a tweet by a user X that has been shared by user Y to all of Y’s followers. Thus, retweets are a way of measuring how popular the tweet is.

A user can ‘favorite’ a tweet; this is analogous to a ‘Like’ on Facebook. A Reply on twitter means responding to a message or tweet from a person while to retweet is to broadcast (like forwarding an email) a tweet or message posted by a person to others.

There are two ways to reply to tweets. There is a @ reply where you use “@username” in your message. Such replies are public i.e. visible on your Twitter page. If you want to reply privately, you can send a DM (direct message) which is sent only to the recipient like a private e-mail.

Also, a few other relevant information such as username, time stamps, location can play an important role to understand the origination of tweets, profile inclination, and timeline analysis.

Step 4: Setting Up The Handshake Between Twitter App And R

The handshake will be done using the OAuth mechanism. For this, the consumer secret key, consumer secret, auth-token, and auth token secret were created using the twitter app management portal and then using setup_twitter_oauth () the connection was established to pull the data

install.packages("ROAuth")
install.packages("twitteR")
install.packages("base64enc")
library (twitteR)
library(ROAuth)
library(base64enc)

# Keys & Secrets
consumerKey = XXXX ==> Enter your consumer key here
consumerSecret = XXXX ==> Enter your consumer secret here
accessToken = XXXX ==> Enter your access token here
accessTokenSecret = XXXX ==> Enter your access token secret here

# Download "cacert.pem" file
download.file(url="http://curl.haxx.se/ca/cacert.pem",destfile="cacert.pem")
setup_twitter_oauth(consumerKey,consumerSecret,accessToken,accessTokenSecret)

Step 5: Extracting tweets For Keyword "Donald Trump"

Once you have done the handshake between twitter app and R, start RStudio and download package twitteR. You need to install a few supporting packages as well.

Searching for "Donald Trump" Tweets

You can search for tweets using keywords, user timeline, or hashtags. Using function searchTwitter followed by a few specifications: keywords or hashtag, the number of tweets to extract (n), and the language of the tweets. If you are analyzing the events which spread across multiple days, then don’t forget to provide the ‘since’ and ‘until’ parameters.

There is a limitation on how far back you can search for tweets. So it is a good idea to save these tweets in CSV files.

tweets <- twListToDF(searchTwitter("Donald Trump", n=10000, lang="en"))
write.csv(tweets, <CSV file path>, row.names = FALSE)

I choose to extract the tweets which contain the word “Donald Trump” and set the number of tweets to extract as 10,000 and language English. Why Donald Trump? He is one of most followed personality on Twitter, and lots of attraction created whenever he tweeted. So, I am sure that this keyword is enough to fetch a good amount of tweet to start our analysis.

twListToDF function from the twitteR package transform the tweets to a nice data frame structure and write.csv function to save the tweets to the CSV file.

Step 6: Analyse The Tweets

Now, lets deep dive to analyze the tweets. Load the tweets extract file RStudio workspace using read.csv function, set ‘stringAsFactor’ to false to load string variable as a plain string.

Create another R script on Rstudio, and import and load all the required packages.

###########################################################################
# Import Packages
###########################################################################
install.packages('ggplot2')
install.packages('dplyr')
install.packages('tidyverse')
install.packages('igraph')
install.packages('tm')
install.packages('wordcloud')

###########################################################################
# Load Packages
###########################################################################
library(ggplot2)
library(dplyr)
library(tidyverse)
library(igraph)
library(tm)
library(wordcloud)

################################################################################
# Load the tweet file for keyword Donald Trump
################################################################################
dt_tweets=read.csv(, stringsAsFactors = F)
print(colnames(dt_tweets)) #Columns Name

Tweets, Retweets, And Liked

A retweet is a tweet by a user X that has been shared by user Y to all of Y’s followers. It is like sharing on Facebook. Usually, a higher number of retweets higher popularity.

You need to start your analysis from finding the organic tweets. To find out the organic tweets you need to eliminate the retweets and replies. Then arrange them in decreasing order of favorite count and retweeted count.

ggplot is used to plot bar charts to show the top 10 most liked and retweeted tweets

################################################################################
# Organic Tweets, Retweets and Replies
################################################################################
dt_tweets_organic = dt_tweets[dt_tweets$isRetweet==FALSE,]
dt_tweets_organic <- subset(dt_tweets_organic,
                          is.na(dt_tweets_organic$replyToSN))
dt_tweets_organic <- dt_tweets_organic %>% arrange(-favoriteCount)
tmp=dt_tweets_organic[1:10,]
ggplot(tmp, aes(x=reorder(text, favoriteCount), y=favoriteCount))+
geom_bar(stat="identity", fill =  "#377F97")+
ggtitle("Top 10 Most Liked Tweets")+
labs(x="Organic Tweets", y=("Favorite Count"))+
theme(axis.text.y = element_text(face="bold", color="black", size=8, hjust = 1), 
      plot.title = element_text(hjust = 0.5))+
coord_flip()+
scale_x_discrete(labels = function(x) str_wrap(x, width = 80))
most liked tweets twitter analysis
# Sort the organic tweets by retweet count, and then select only first 10 tweets
dt_tweets_organic <- dt_tweets_organic %>% arrange(-retweetCount)
tmp=dt_tweets_organic[1:10,]
ggplot(tmp, aes(x=reorder(text, retweetCount), y=retweetCount))+
geom_bar(stat="identity", fill =  "#377F97")+
ggtitle("Top 10 Most Retweeted Tweets")+labs(x="Organic Tweets", y=("Retweet Count"))+
theme(axis.text.y = element_text(face="bold", color="black", size=8, hjust = 1), 
      plot.title = element_text(hjust = 0.5))+coord_flip()+
scale_x_discrete(labels = function(x) str_wrap(x, width = 80))
Twitter Analysis - most retweeted tweets

Distribution of Tweets, Retweets And Replies

Usually, it has been observed that retweets are more as compared to organic tweets. You can retweet your own tweet and tweet from someone else. People use ‘RT’ to indicate that it’s a retweet. You can also consider retweet as share. A reply, on the other hand, is a response to another person’s tweet.

The ratio of organic tweets, retweets, and replies helps us to understand the tweet writer connects with his followers. Therefore, it is preferable to have a good ratio among them.

################################################################################
# Tweets Vs Retweets Vs Replies
################################################################################
retweets = dt_tweets[dt_tweets$isRetweet==TRUE,]
replies <- subset(dt_tweets, !is.na(dt_tweets$replyToSN))
# Create Dataframe
tmp <- data.frame(
type=c("Organic", "Retweets", "Replies"),
count=c(dim(dt_tweets_organic)[1], dim(retweets)[1], dim(replies)[1]))
ggplot(tmp, aes(x="", y=count, fill=type))+
geom_bar(stat="identity")+  
ggtitle("Organic Tweets Vs Retweets Vs Replies")+
labs(x="Tweets Vs Retweets", y="Frequency")+
theme(plot.title = element_text(hjust = 0.5))+
coord_polar("y", start=0)

Tweets Frequency Per Minutes

Tweets frequency is another important insight to look for. Just imagine, there are more than 10,000 tweets/retweets/replies in less than 30 mins. This also indicates a strong presence and popularity on the social media platform. That is also one of the reasons why I have selected this keyword.

You can plot a timeline chart on any timeframe axis (monthly, weekly, daily, or hourly). As I have data for only 25 mins, so plotting a timeline chart on minutes. In a minute, on average, 400 tweets/retweets/replies occurred.

################################################################################ 
# When People tweeted or Retweeted most - min-by-min Analysis
################################################################################
#Time_Analysis=lapply(dt_tweets$created, FUN=function(x) strsplit(x,split = ":")[[1]][2])
Time_Analysis = lapply(dt_tweets$created, FUN=function(x) substr(x[[1]], 11, 16))
tmp=data.frame(table(unlist(Time_Analysis)))
# Plot
ggplot(data=tmp, aes(x=Var1, y=Freq, group=1)) +
geom_line(size=1, color="red")+
geom_point(size=3, color="red")+
ggtitle("Number of tweets by Minutes")+
labs(x="TimeStamp", y="Frequency - Tweets/Retweets/Replies")+
theme(plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(size=10, angle=35))
Twitter Analysis - Tweets by minutes

Tweets Origination Source

Let’s say you head of the marketing campaign team, and your team just launched a campaign a week back for the desktop users. So when you analyzed the response report, you surprised to see the low-level turnaround. What happened? Have you forgotten to consider this important customer-device engagement metrics?

Understanding this metric helps you to increase your reach to potential customers.

################################################################################  
# Tweets Source
################################################################################
source <- lapply(dt_tweets$statusSource,
          FUN=function(x) strsplit(strsplit(x,split = ">")[[1]][2], 
source <- gsub('for ', '',gsub('Twitter ', '', source))
tmp <- data.frame(table(unlist(source)))
tmp <- tmp %>% arrange(-Freq)
tmp <- tmp[1:5,]
ggplot(tmp, aes(x=Var1, y=Freq, fill=Var1))+
geom_bar(stat="identity")+  
ggtitle("Tweets Source Frequency")+
labs(x="Tweets Source", y="Frequency")+
theme(plot.title = element_text(hjust = 0.5))

Hashtags Analysis

A hashtag starts with a “#” symbol and is used to index keyword. You can find trending topics or keywords by filtering out the hashtags. This makes the search super-efficient.

You can extract hashtags using grep function, and then summarise it to find the most trending topics.

####################################################################
# Popular Hash Tags
####################################################################
tags <- function(x) toupper(grep("^#", strsplit(x, " +")[[1]], value = TRUE))
l=nrow(dt_tweets_organic)

# Create a list of the tag sets for each tweet
taglist <- vector(mode = "list", l)

# and populate it
for (i in 1:l) taglist[[i]] <- tags(dt_tweets_organic$text[i])
tmp=table(unlist(taglist))
tmp=sort(tmp, decreasing = T) %>% as.data.frame()
tmp$Var1=gsub("#","",tmp$Var1)
tmp$Var1=gsub(",","",tmp$Var1)
tmp=tmp[1:10,]
ggplot(tmp, aes(x=reorder(Var1, Freq), y=Freq))+
geom_bar(stat="identity", fill = "#377F97")+
ggtitle("Popular Hashtags")+labs(x="Hashtags", y="Frequency")+
theme(axis.text.x = element_text(angle = 70,face="bold",
color="black", size=12, hjust = 1),
plot.title = element_text(hjust = 0.5))+coord_flip()
Twitter Analysis - Popular Hashtags

Find Most Active Users

You may have noticed that few people are super active, and they can spread information in no time. Similar logic applies to twitter too.

You can find the most active user by consolidating the ‘screenName’ columns.

################################################################################
# Most Active People - Tweets/Retweets/Replies
################################################################################
tmp=dt_tweets$screenName
tmp=sort(table(tmp), decreasing = T)[1:25]
tmp=data.frame(tmp)
ggplot(tmp, aes(x=reorder(tmp, Freq), y=Freq))+
geom_bar(stat="identity", fill =  "#377F97")+
ggtitle("Most Active People - Tweets/Retweets/Replies")+
labs(x="Active Users", y="Frequency")+
theme(axis.text.x = element_text(angle = 70,face="bold", 
      color="black", size=12, hjust = 1),
plot.title = element_text(hjust = 0.5))+
coord_flip()
Twitter Analysis - Most Active Users

Find Most Popular Users

You can combine ‘screenName’ and ‘retweetCount’ to find out the most popular users in your network. These people might be having a strong follower base, and lots of twitter users admire them.

################################################################################
# Most Popular Users on Twitter whose tweets trended
################################################################################
tmp=dt_tweets_organic[dt_tweets$isRetweet==FALSE,]
tmp=tmp %>%
select(screenName, retweetCount) %>% 
group_by(screenName) %>% 
summarise(retweetCount=sum(retweetCount)) %>% 
arrange(desc(retweetCount)) %>% 
as.data.frame()
ggplot(tmp[1:10,], aes(x=reorder(screenName, retweetCount), y=retweetCount))+
geom_bar(stat="identity", fill =  "#377F97")+coord_flip()+ggtitle("Most Popular Twitter Users")+
labs(x="Retweets", y="Frequency")+
theme(axis.text.x = element_text(angle = 0,color="black", size=12, hjust = 1),
      plot.title = element_text(hjust = 0.5))
Twitter Analysis - Most Popular Users

Find Influencers

An influencer is someone with millions of followers and a global presence. You can reach a larger mass using influencers. You can simply find them by analyzing the number of retweeted and mentioned users.

# Most Influential Twitter Handle - define a Twitter Handle extractor function
handles <- function(x) toupper(grep("^@", strsplit(x, " +")[[1]], value = TRUE))
l=nrow(dt_tweets_organic)
# Create a list of the tag sets for each tweet
handleslist <- vector(mode = "list", l)
# ... and populate it
for (i in 1:l) handleslist[[i]] <- handles(dt_tweets$text[i][dt_tweets$isRetweet==TRUE])
tmp=table(unlist(handleslist))
tmp=sort(tmp, decreasing = T) %>%
as.data.frame()
# Top Rewteeted Users
Retweets = tmp[grep(":", tmp$Var1),]
Retweets$Var1=gsub("@","",Retweets$Var1)
Retweets$Var1=gsub(":","",Retweets$Var1)
ggplot(Retweets[1:15,], aes(x=reorder(Var1,Freq), y=Freq))+geom_bar(stat="identity",fill =  "#377F97")+
ggtitle("Top Retweeted Users")+labs(x="Twitter Handles", y="Frequency")+
theme(axis.text.x = element_text(angle = 70,face="bold", color="black", size=12, hjust = 1),
      plot.title = element_text(hjust = 0.5))+coord_flip()

# Top Mentioned Users
Mentions = tmp[-grep(":", tmp$Var1),]
Mentions$Var1=gsub("@","",Mentions$Var1)
ggplot(Mentions[1:15,], aes(x=reorder(Var1, Freq), y=Freq))+geom_bar(stat="identity", fill =  "#377F97")+
ggtitle("Most Mentioned ScreenName")+labs( x="Twitter Handles", y="Frequency")+
theme(axis.text.x = element_text(angle = 70,face="bold", color="black", size=12, hjust = 1),
      plot.title = element_text(hjust = 0.5))+
coord_flip()

Retweet Network

Retweet network diagrams are helpful to analyze social and behavioral studies. It is similar to social networks and helps us to find the various cluster among the users. You can also analyze the political and social orientation of the users.

You can use the same concept to create many other useful networks to analyze the behavior.

################################################################################
# Retweet Network
################################################################################
alltweets<-dt_tweets[1:200,]
sp = split(alltweets, alltweets$isRetweet)
rt = mutate(sp[['TRUE']], sender = substr(text, 5, regexpr(':', text) - 1))
el = as.data.frame(cbind(sender = tolower(rt$sender), receiver = tolower(rt$screenName)))
el = count(el, sender, receiver)
el[1:5,] #show the first 5 edges in the edgelist

# Based on the edge-list, create a retweet network.
rt_graph <- graph_from_data_frame(d=el, directed=T)
glay = layout.fruchterman.reingold(rt_graph)
par(bg="gray15", mar=c(1,1,1,1))
plot(rt_graph, layout=glay,
   vertex.color="gray25",
   vertex.size=(degree(rt_graph, mode = "in")), #sized by in-degree centrality
   vertex.label.family="sans",
   vertex.shape="circle",  #can also try "square", "rectangle", etc. More in igraph manual
   vertex.label.color=hsv(h=0, s=0, v=.95, alpha=0.5),
   vertex.label.cex=(degree(rt_graph, mode = "in"))/300, #sized by in-degree centrality
   edge.arrow.size=0.8,
   edge.arrow.width=0.5,
   edge.width=edge_attr(rt_graph)$n/10, #sized by edge weight
   edge.color=hsv(h=.95, s=1, v=.7, alpha=0.5))
par(bg="white", mar=c(1,1,1,1))
Twitter Analysis - Retweet Network Diagram

Understand Context - Most Frequent Words

You cannot read through all tweets to understand the context of it. Maybe for a smaller sample, you can do that but what you will do if you need to analyze millions of tweets.

Most frequent word analysis is helpful to summarise all the tweets. This report gives you a snapshot of the content. You can also derive various topics using the most frequent word analysis.

Tweets are comprising of numbers, URLs, links, non-English words, punctuation, and so many other irrelevant information.  Make sure to clean your text data to get meaningful insight.

Another major problem with text cleaning is stopwords. Stopwords are words in English that are commonly used in every sentence but have no analytical significance. Examples are ‘is’, ‘but’, ‘shall’, ‘by’ etc. These words were removed by matching the corpus with the stopwords list in the “tm” package of R.

Tweets may contain many non-English words that do not add any values. Analyze all frequent non-English words and remove it from study

Words can have multiple forms, and they all derive from the common root word. For example, ‘like’, ‘liked’, ‘likely’, and ‘liking’ derived from the root word ‘like’, and therefore stemmed to root word.

Punctuation, white-space, and impure character make no impact on the analysis. You can remove all these words too.

############################################################################################################
# Most Frequent Words
############################################################################################################
# Clean Function
clean_text = function(x)
{
x = gsub("(RT|via)((?:\\b\\W*@\\w+)+)", " ", x) # remove Retweet
x = gsub("@\\w+", " ", x) # remove at(@)
x = gsub("[[:punct:]]", " ", x) # remove punctuation
x = gsub("[[:digit:]]", " ", x) # remove numbers/Digits
x = gsub('[[:cntrl:]]', ' ', x) #
x = gsub("http[[:alnum:]]*", " ", x) # remove url links
x = gsub("http\\w+", " ", x) # remove links http
x = gsub("[ |\t]{2,}", " ", x) # remove tabs
x = gsub(" . ", " ", x) # remove single character words
x = gsub("^ ", " ", x) # remove blank spaces at the beginning
x = gsub(" $", " ", x) # remove blank spaces at the end
x = gsub('^\\s+|\\s+$', ' ', x) # remove extra space
x <- iconv(x, 'UTF-8', 'ASCII')
x = gsub("edUAUB\\S*"," ",x)
x = gsub('Tmp F.*', ' ', x, ignore.case=T)

try.error = function(z) #To convert the text in lowercase
{
y = NA
try_error = tryCatch(tolower(z), error=function(e) e)
if (!inherits(try_error, "error"))
y = tolower(z)
return(y)
}
x = sapply(x,try.error)
return(x)
}

# Creating Corpus
myCorpus <- Corpus(VectorSource(clean_text(dt_tweets_organic$text)))

# convert to lower case
myCorpus <- tm_map(myCorpus, content_transformer(tolower))

# Removing stop words
myStopwords <- c(stopwords("english"), "will","get", "tht","also","cha","trp",
"amp", "donald","trump", "can")
myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

## keep a copy of corpus to use later as a dictionary for stem completion
myCorpusCopy <- myCorpus

# stem words
myCorpus <- tm_map(myCorpus, stemDocument)

# Stem completetion
mycorpus_vec<-stemCompletion(myCorpus,myCorpusCopy,"prevalent")
mycorpus<-Corpus(VectorSource(mycorpus_vec))

# Covert corpus into term document
tdm <- TermDocumentMatrix(mycorpus,control = list(removePunctuation = TRUE,
stripWhitespace=TRUE,
stopwords=myStopwords,
removeNumbers = TRUE,
tolower = TRUE))

Data is cleaned and arranged in a term-document matrix or TDM. Term-Document Matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. he rows of the matrix represent the text responses to be analyzed, and the columns of the matrix represent the words from the text that is to be used in the analysis.

term.freq <- rowSums(as.matrix(tdm))
term.freq <- subset(term.freq, term.freq >=10)
tmp <- data.frame(term = names(term.freq), freq = term.freq)
ggplot(tmp, aes(reorder(term, freq),freq))+
theme_bw()+ 
geom_bar(stat = "identity", fill =  "#377F97" )+ 
coord_flip()+
labs(title="Most Frequent Terms", y="Frequency", x="Terms")+
theme(plot.title = element_text(hjust = 0.5))
Twitter Analysis - Most Frequent Terms

Wordcloud

You can generate a word cloud using an R package named as a word cloud. It is also one of the most popular ways to visualize and analyze qualitative data. It uses word frequency pairs to generate an image where size represents the frequency of the word in the corpus.

################################################################################
# Wordcloud
################################################################################
word.freq <-sort(rowSums(as.matrix(tdm)), decreasing= F)
pal<- brewer.pal(8, "Dark2")
wordcloud(words = names(word.freq), freq = word.freq, min.freq = 3, random.order = F,
        colors = pal, max.words = 1000)

Step 7: Twitter Sentiment Analysis

Sentiment Analysis captures the overall tone of users. It refers to determining the opinions or sentiments expressed on different features or aspects of entities. 

Sentiment analysis helps to discover the overall sentiments around the topic and useful in understanding the review, survey responses, product perception, marketing, and many other fields.

################################################################################
# Sentiment Analysis
################################################################################
mysentiment<-get_nrc_sentiment((dt_tweets_organic$text))
# Get the sentiment score for each emotion
mysentiment.positive =sum(mysentiment$positive)
mysentiment.anger =sum(mysentiment$anger)
mysentiment.anticipation =sum(mysentiment$anticipation)
mysentiment.disgust =sum(mysentiment$disgust)
mysentiment.fear =sum(mysentiment$fear)
mysentiment.joy =sum(mysentiment$joy)
mysentiment.sadness =sum(mysentiment$sadness)
mysentiment.surprise =sum(mysentiment$surprise)
mysentiment.trust =sum(mysentiment$trust)
mysentiment.negative =sum(mysentiment$negative)

# Create the bar chart
yAxis <- c(mysentiment.positive,
         + mysentiment.anger,
         + mysentiment.anticipation,
         + mysentiment.disgust,
         + mysentiment.fear,
         + mysentiment.joy,
         + mysentiment.sadness,
         + mysentiment.surprise,
         + mysentiment.trust,
         + mysentiment.negative)
xAxis <- c("Positive","Anger","Anticipation","Disgust","Fear","Joy","Sadness",
         "Surprise","Trust","Negative")
colors <- c("green","red","blue","orange","red",
"green","orange","blue","green","red")
sent=data.frame(xAxis=Sentiments, yAxis=Score)
ggplot(sent, aes(x=xAxis, y=yAxis, fill=as.factor(xAxis)))+
geom_bar(stat = "identity")+
labs(title="Sentiment Analysis", y="Sentiment Score", x="Sentiment Category")+
theme(plot.title = element_text(hjust = 0.5))+
guides(fill=guide_legend(title="Sentiment"))

Step 8: Natural Language Processing - Advanced Twitter Analytics

Word Associations

You can explore the data in an association’s sense by looking at collocation, or those terms that frequently co-occur. Using word associations, you can find the correlation between certain terms and how they appear in tweets.  Sometimes it reveals interesting word associations.

################################################################################
# Word Associations
################################################################################
findAssocs(tdm, c("president", "economy", "racist", "obama"),
         c(0.2, 0.2, 0.2, 0.2))
Twitter Analysis - Word Associations
# Word Association Network
freq.terms <- findFreqTerms(tdm, lowfreq=8)
plot(tdm, term = freq.terms, corThreshold =0.05, weighting = F,
   attrs=list(node=list(width=15, fontsize=40, 
   fontcolor="blue", color="red")))

Bi-Grams Analysis

N-grams of texts are extensively used in text mining and natural language processing tasks. They are a set of co-occurring words within a given window. For example, for the sentence “The cow jumps over the moon”. If N=2 (known as bigrams), then the n-grams would be:

  • the cow
  • cow jumps
  • jumps over
  • over the
  • the moon
################################################################################
# Biagram Analysis
################################################################################
tmp=dt_tweets_organic
tmp$text<-clean_text(dt_tweets_organic$text)
myStopwords <- c(stopwords("english"), "will","get", "tht","also","cha","trp","amp", "yomk8mxlx1",
               "oec9ymddlt")
# Remove Single letter words
tmp$text=gsub('\\b\\w{1,2}\\b', " ", tmp$text) 
tidy_descr_ngrams <- tmp %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>% 
separate(bigram, c("word1", "word2"), sep = " ") %>%
filter(!word1 %in% myStopwords) %>%
filter(!word2 %in% myStopwords) %>% 
select(word1, word2)

bigram_counts <- tidy_descr_ngrams %>%
dplyr::count(word1, word2, sort = TRUE)
bigram_graph <- bigram_counts[1:100,] %>%
filter(n > 1) %>%
graph_from_data_frame()
set.seed(1)
a <- grid::arrow(type = "closed", length = unit(.15, "inches"))
ggraph(bigram_graph, layout = "nicely") +
geom_edge_link(aes(edge_alpha = n), show.legend = FALSE,
               arrow = a, end_cap = circle(.07, 'inches')) +
geom_node_point(color =  "#377F97", size = 3, alpha = 0.9) +
geom_node_text(aes(label = name), vjust = 1, hjust = 1) +
theme_void()+
theme(plot.title = element_text(hjust = 0.5))

Topic Modelling - Latent Dirichlet Association

Topic modeling is a process that automatically deduces the themes of the text. It is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents. 

A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about subject A and 90% about subject B, there would probably be about 9 times more words about ‘B’ than words about ‘A’ (Wikipedia, n.d.).

Topic modeling has implementations in various algorithms, but the most common algorithm in use is Latent Dirichlet Allocation (LDA). 

################################################################################
# Topic Modelling
################################################################################
dtm <- as.DocumentTermMatrix(tdm)
rowTotals <- apply(dtm , 1, sum)
NullDocs <- dtm[rowTotals==0, ]
dtm   <- dtm[rowTotals> 0, ]

lda3 <- LDA(dtm, k = 3) # find 3 topic
term3 <- terms(lda3, 8) # first 8 terms of every topic
(term <- apply(term3, MARGIN = 2, paste, collapse = ", "))
Twitter Analysis - Topic Modelling

Comparison Wordcloud

A comparison cloud is similar to the word cloud and it allows us to analyze the similarities and dissimilarities between the two or more emotions word cloud. 

##########################################################################################################
# Comparision Wordcloud
##########################################################################################################
# function to make the text suitable for analysis
clean.text = function(x)
{ # tolower x = tolower(x) # remove rt x = gsub("rt", " ", x) # remove at x = gsub("@\\w+", " ", x) # remove punctuation x = gsub("[[:punct:]]", " ", x) # remove numbers x = gsub("[[:digit:]]", " ", x) # remove links http x = gsub("http\\w+", " ", x) # remove tabs x = gsub("[ |\t]{2,}", " ", x) # remove blank spaces at the beginning x = gsub("^ ", " ", x) # remove blank spaces at the end x = gsub(" $", " ", x) x = gsub("edUAUB\\S*"," ",x) x = gsub("eduaub\\S*"," ",x) return(x) } # function to get various sentiment scores, using the syuzhet package scoreSentiment = function(tab) { tab$syuzhet = get_sentiment(tab$text, method="syuzhet") tab$bing = get_sentiment(tab$text, method="bing") tab$afinn = get_sentiment(tab$text, method="afinn") tab$nrc = get_sentiment(tab$text, method="nrc") emotions = get_nrc_sentiment(tab$text) n = names(emotions) for (nn in n) tab[, nn] = emotions[nn] return(tab) } tweets = scoreSentiment(dt_tweets_organic) # emotion analysis: anger, anticipation, disgust, fear, joy, sadness, surprise, trust # put everything in a single vector all = c( paste(tweets$text[tweets$anger > 0], collapse=" "), paste(tweets$text[tweets$anticipation > 0], collapse=" "), paste(tweets$text[tweets$disgust > 0], collapse=" "), paste(tweets$text[tweets$fear > 0], collapse=" "), paste(tweets$text[tweets$joy > 0], collapse=" "), paste(tweets$text[tweets$sadness > 0], collapse=" "), paste(tweets$text[tweets$surprise > 0], collapse=" "), paste(tweets$text[tweets$trust > 0], collapse=" ") ) # clean the text all = clean.text(all) # remove stop-words # adding extra domain specific stop words all = removeWords(all, c(stopwords("english"), "will","get", "tht","also","cha","trp","amp")) # create corpus corpus = Corpus(VectorSource(all)) # create term-document matrix tdm = TermDocumentMatrix(corpus) # convert as matrix tdm = as.matrix(tdm) # add column names colnames(tdm) = c('anger', 'anticipation', 'disgust', 'fear', 'joy', 'sadness', 'surprise', 'trust') # Plot comparison wordcloud # comparison word cloud comparison.cloud(tdm, colors = c("#00B2FF", "red", "#FF0099", "#6600CC", "green", "orange", "blue", "brown"), max.words=3000, scale = c(3,.4), random.order = FALSE, title.size = 1.5)

Conclusion

In this post, you have learned most of the concepts related to twitter analysis using RStudio. You can perform a similar study for the user timeline as well. This is just the beginning you can analyze text data in many dimensions using the R programming language. I am leaving up to you to experiment with the following concepts.

  • Apply Clustering techniques – K-Means and Hierarchical Clustering
  • Train sentiment analysis model using TF-IDF, word2vec and long-short term memory or LSTM
  • Political alignment analysis

Twitter Analytics is a popular tool to understand public sentiment, emotions, and perception. You can utilize these methods in many business domains. 

This post is already longer than I expected. But I am sure that you have learned many important concepts of twitter analysis.

Click here to read about the Twitter end-to-end web application.

Happy Learning.

Leave a Comment

Your email address will not be published. Required fields are marked *