Tapping Twitter: Public Perception of Goods & Service Tax(GST) – Part 2

This is the continuation in series to our twitter analysis on GST, recommend you to read through it. Click here to read the part – 1

To demonstrate this approach we have taken 12th November for this section. This day had total of 39,130 tweets. This day was immediate next day of GST council meet and major changes in rate structure were announced. This day also coincides with peak of activity due to Gujarat elections.

Tweets Vs Retweets

A retweet (‘rt’ for short) is a tweet by a user X that has been shared by user Y to all of Y’s followers. Thus, retweets are a way of measuring how popular the tweet is and also it is less participation as compared to tweets.

As per the data the number of retweets is four times the number of tweets. This trend is seen in most of the data we have collected. It showcases that the twitter works on network effect and followers where an influential person posts his/her thoughts and their followers retweet that particular tweet with their own twist to that particular thought.

Most Popular Twitter User

Social media become the platform for the political ground. On 12th Nov, Swamy39 comes out to the most influential twitter user, this chart has been created using sum of the retweet count for the each user. Next one is BJP4India which is again a BJP twitter handle. This also shows that BJP promotion on GST immediate after the council meet on 11th Nov.

Most Retweeted Tweets

To find out what tweets is resonating most with twitter community, we found out which is the most retweeted tweet. Retweet is similar to sharing in Facebook. When a user retweets a tweet, it is assumed that He/she agrees/endorses the idea in the tweet. However sometimes they can also be showcasing their disagreement/disgust with the retweet hence retweet is highly contextual. This study shows the war of perception that is going on between Congress and BJP to won positive public sentiment. Congress has been vocal in the last two months in criticizing GST as gabbar singh tax (@OfficeofRG) whereas Subramaniam Swamy(@Swamy39) is trying to criticize Congress for its hypocrisy in addition to this Prime minister Narendra Modi is showcasing the openness of the decision-making process by highlighting “Jan Bhagidari” as the core principle of this government

Most Replied/mentioned Twitter Handle

Idea is to find out the twitter handle whose tweets are being retweeted most and they are influencing the twitter opinion. Top Retweeted/Mentioned Twitter handle at the following day shows most of them are from major political parties and journalists.

Influential hash tags

Users create and use hashtags by placing the number sign or pound sign # (also known as the hash character) in front of a string of alphanumeric characters, usually a word or un-spaced phrase, in or at the end of a message. The hashtag may contain letters, digits, and underscores. Searching for that hashtag will yield each message that has been tagged with it. A hashtag archive is consequently collected into a single stream under the same hashtag.

Top 15 hash tags as expected has #GST as the top hash tag. Other important tags coming out on this particular day are #DEMONETISATION, #CHITRAKOOT, #GSTCOUNCILMEET & #GSTCOUNCIL.

Chitrakoot by-election happened on 9th November and the seat was retained by Congress in a state that is currently ruled by BJP and hence was one of the major discussion points on this date.

Most replied Screen Names or Users

Replies are responses posted in specific purpose to someone’s Twitter post. Thus, you can respond to posts to be in touch with followers as well as follow someone on Twitter. Replying is a good and comfortable way to have conversations with many people and share ideas as well as information.

Most replied screen name analysis done for the specific date again shows most of them are from political parties or are from journalistic background with exception being “askGST_GoI”, “TVMohandasPai” (businessman), GST_Council

Hour by Hour analysis

Idea here to analyse the twitter traffic for a day and to understand how the traffic is flowing in each hour of day. Is there a period of day where people are more active? This helps us to understand the twitter user behavior. This behavior has been seen for many others days too.

Retweet network analysis

This is an attempt to understand the nature of relationship of people on twitter who are retweeting each other tweets. This is the graphical representation of the retweet network. Since this was computationally very expensive to create this chart hence we have used only 150 tweets to do the analysis.

Twitter User network analysis and how they are linked with their political ideology is a very interesting field. There is lot of research going on and data scientist are using this as one of the key factors when they try to understand a person political profile.

Manual analysis of user network of 12th November clearly shows how these networks work. Swamy 39 is twitter handle of Subramaniam Swamy who is currently a BJP Rajya Sabha MP and critic of Congress. As he tweets, we can see people in his immediate retweet network is either direct BJP supporter or people who believe in philosophy of BJP. Below is the analysis of few people.

Most frequent word used

Next we analysed most frequent words used after the key event on 11th Nov, in which GST council reduced rates for many items. Most frequents term were “council”, “rates”, “slab” and “afford”. We also have seen the word like “sanitary”, “pad” etc. There was a debate that happened on social media platform following revised rates by GST council on sanitary pad, and lot of users tweeted about it.


  • Sanitary napkins are classifiable under heading 9619. In pre-GST, they attracted concessional excise duty of 6%
  • Lie DSouza, it’s u who needs to do the maths. U really think family running on Rs 33/Day can afford expensive sanitary pad
  • Isn’t it that govt shud encourage it be offering subsidy instead of further applying GST on sanitary napkins

Word Association

We also explore the data in an associations sense by looking at collocation, or those terms that frequently co-occur. Using word associations, we can find the correlation between certain terms and how they appear in tweets (documents). We can perform word associations with the findAssocs() function.

Word association analysis shows that the major even during that time was reduction of tax slab for restaurants which is shown when the word association is run for word tax which is showing restaurant and service as one of the top matching associated word.

When the word association is run for traders we can see the major words coming are chaotic, pain, suffering showing negative sentiment in the trader community who are unhappy with the tedious compliance process.

What are N-Grams?

N-grams of texts are extensively used in text mining and natural language processing tasks. They are basically a set of co-occurring words within a given window and when computing the n-grams you typically move one word forward (although you can move X words forward in more advanced scenarios). For example, for the sentence “The cow jumps over the moon”. If N=2 (known as bigrams), then the ngrams would be:

  • the cow
  • cow jumps
  • jumps over
  • over the
  • the moon

So you have 5 n-grams in this case. Notice that we moved from the->cow to cow->jumps to jumps->over, etc., essentially moving one word forward to generate the next bigram.

If N=3, the n-grams would be: 

  • the cow jumps
  • cow jumps over
  • jumps over the
  • over the moon

So you have 4 n-grams in this case. When N=1, this is referred to as uni-grams and this is essentially the individual words in a sentence. When N=2, this is called bi-grams and when N=3 this is called trig-rams. When N>3 this is usually referred to as four grams or five grams and so on. 

Bi-gram analysis

We have performed bi-gram analysis to get deeper insight. Bi-gram analysis shows many subplots in the broader picture like expensive sanitary pads & commercial pads, simplified process filing, modi government bringing relief, gabbar singh tax & genuine simple tax. Each of the story can be further taken and can be analysed.

Tri-gram analysis

Tri-gram analysis further shows the subplots in the story which can be seen in the bi-gram analysis as well. Tri-gram does a job of reinforcing some of the story-line that appears in bi-gram analysis. Few other stories that are coming high compliance laudable, final doc & draft doc in which people are talking about how half cooked GST was implemented and the government has to come up with so many changes later on, congress wins chitrakoot etc.

Clustering and Topic ModellingHierarchical Clustering

Hierarchical clustering attempts to build different levels of clusters. Strategies for hierarchical clustering fall into two types:

Agglomerative: where we start out with each document in its own cluster. The algorithm iteratively merges documents or clusters that are closest to each other until the entire corpus forms a single cluster. Each merge happens at a different (increasing) distance.

Divisive: where we start out with the entire set of documents in a single cluster. At each step the algorithm splits the cluster recursively until each document is in its own cluster. This is basically the inverse of an agglomerative strategy.

The results of hierarchical clustering are usually presented in a dendrogram.

The R function, hclust() was used to perform hierarchical clustering. It uses the agglomerative method. The following steps explain hierarchical clustering in simple terms:

  1. Assign each document to its own (single member) cluster
  2. Find the pair of clusters that are closest to each other and merge them, leaving us with one less cluster
  3. Compute distances between the new cluster and each of the old clusters
  4. Repeat steps 2 and 3 until you have a single cluster containing all documents

To perform this operation, the corpus was converted into a matrix with each tweet (or ‘document’) given an ID. Extremely sparse rows, i.e. rows with elements that are part of less than 2% of the entire corpus were removed. Ward’s method for hierarchical clustering was used.

The dendrogram output is to be interpreted as follows: Farther the nodes, greater is the dissimilarity and more robust is that the closer the node, the weaker is the height of each node in the plot is proportional to the value of the inter-group dissimilarity between its two

K-Means Clustering

As opposed hierarchical clustering, where one does not arrive at the number of clusters until after the dendrogram, in K-means, the number of clusters is decided beforehand. The algorithm then generates k document clusters in a way that ensures the within-cluster distances from each cluster member to the centroid (or geometric mean) of the cluster is minimized.

A simplified description of the algorithm is as follows:

  • Assign the documents randomly to k bins
  • Compute the location of the centroid of each
  • Compute the distance between each document and each centroid
  • Assign each document to the bin corresponding to the centroid closest to
  • Stop if no document is moved to a new bin, else go to step

Choosing k – The most significant factor of employing k-means clustering is choosing the no. of clusters, ‘k’. The ‘elbow method’, wherein the SUM of Squared Error (SSE, the sum of the squared distance between each member of the cluster and its centroid) decreases abruptly at that value that is theoretically the optimal value of k, is widely applied to arrive at k.

When k is plotted against the SSE, it will be seen that the error decreases as k gets larger; this is because when the number of clusters increases, they become smaller, and hence the distortion is also smaller, we experimented with 4,8,12 clusters in topic mode and got the best results

To make it easy to find what the clusters are about, we then check the top few words in every cluster.

·•    cluster 1: congress bjp rahul gandhi modi india even 
 •    cluster 2: tax service charge vat gstcouncilmeet simple good 
 •    cluster 3: india congress true want singh gabbar tax 
 •    cluster 4: day afford lie needs really sanitary think 
 •    cluster 5: jaitley congis unanimous voted regarding raised bigger 
 •    cluster 6: rates tax government council high low items 
 •    cluster 7: rates items less cost daily new consumers 
 •    cluster 8: modi government big relief items consumers bjp 
 •    cluster 9: pads per won day even afford able 
 •    cluster 10: rahul gandhi says one rate slab just 
 •    cluster 11: people modi govt items says cut council 


The cluster on the top is the one in which Congress and BJP political war of words is being discussed. The next one is also different and it talks about the theme in which probably discussion is going on about tax, bjp and congress.

Rest of the topics that are clustered together are

  1. Discussion around affordability of sanitary pads which came out to be one of the major themes in another analysis as well.
  2. The cluster after that is also talking about Sanitary pads
  3. The next big clusters is talking about the rate cuts that happened a day before in GST Council meeting  and discussion around that,

Topic Modelling

Another technique that is employed to deduce the themes of text is topic modelling.

A topic model is a type of statistical model for discovering the abstract “topics” that occur in a collection of documents (tweets in this case). Intuitively, given that a document is about a particular topic, one would expect particular words to appear in the document more or less frequently: in this case, ‘help’ is quite common to almost every tweet.

A document typically concerns multiple topics in different proportions; thus, in a document that is 10% about subject A and 90% about subject B, there would probably be about 9 times more words about ‘B’ than words about ‘A’ (Wikipedia, n.d.).

Topic modelling has implementations in various algorithms, but the most common algorithm in use is Latent Dirichlet Allocation (LDA).

Latent Dirichlet Association

Latent Dirichlet Allocation (LDA) is a statistical model that allows sets of observations to be explained by unobserved groups that explain why some parts of the data are similar. For example, if observations are words collected into documents, it posits that each document is a mixture of a small number of topics and that each word’s creation is attributable to one of the document’s topics.

LDA allows the possibility of a document to arise from a combination of topics.

The Topic on the top is the one in which Congress is criticized for their hypocritical nature. The next one is also different and it talks about the theme in which probably discussion is going on about tax, BJP and Congress.

Rest of the topics that are clustered together are

  1. Discussion around affordability of sanitary pads which came out to be one of the major themes in another analysis as well.
  2. The cluster after that is also talking about Sanitary pads

The next big clusters is talking about the rate cuts that happened a day before and discussion around that, Gujarat is featuring in the cluster here as part of active election campaign and the way GST related incidents are being interpreted.

Sentiment Analysis

The ‘syuzhet’ Package extracts sentiments from text using three sentiment dictionaries. The difference between this and the above approach is that this approach is based on a much wider range of sentiments. 

Post processing, we will use the ‘get_nrc_sentiment’ function to extract sentiments from the tweets. How this function works is that it Calls the NRC sentiment dictionary to calculate the presence of different emotions and their corresponding valence in a text file.

The output is a data frame where each row represents a sentence from the original file. The columns include one for each emotion type as well as a positive or negative valence. The ten columns are as follows: “anger”, “anticipation”, “disgust”, “fear”, “joy”, “sadness”, “surprise”, “trust”, “negative”, “positive.”

Looking at the bar chart and the sum of these emotions, we can see that the positive sentiments (‘positive’, ‘joy’, ‘trust’) comfortably outscore the negative emotions (‘negative’, ‘disgust’, ‘anger’). This may be a hint that may be the audience has received the move positively.

Overall people were positive, people has shown trust on the decision but there are chinks in the armour as fear, disgust, anger and sadness even though they are not as large as the positive sentiment but it is still there and the Government should be on their guard and should try to address the issue people are having with GST.

Conclusion & Limitations

It is quite evident that the twitter is the new battle ground where the ideology wars are won these days. It is the platform that is used increasingly to change the public perception, run propaganda. Apart from these it is perfect platform to interact with the public in large and get the feedback on major policies reforms and decisions.

It has been demonstrated that crucial information like the pain-hit areas of implementation can be identified by analysing tweets and performing basic text analytics. It can help policymakers to sense the pulse of the public and to navigate through the complex road of major policy reform implementation.

The government agencies, and other policy making agencies would do well to develop analytics capabilities focused on mining Twitter for real-time, tangible updates to take meaningful action and address that key issues that are arising out of the implementation of the policy and keep the people on their side and see the successful implementation of any major reforms.

Pain areas that were highlighted during sentiment analysis were as follows:

  • Trade & textile industry – A lot of trade in textile industry was dependent on non-cash payments and kacha bills, however with GST getting implemented these options were closed.
  • Compliance issues – One of the important reasons to implement GST was to ease the filling process and standardize taxes

We have observed that twitter is very commonly being used as a platform for deliberation by citizens of India. It has been concluded that social media is a powerful and reliable source of public opinion as far as a nation like India is concerned. The discussions on twitter are equivalent to traditional discussions and are capable enough to give a fair idea of emotions of general public. We have done sentiment analysis of emotions of people which shows people’s acceptance for GST but with too much of anticipation feeling.

Hopefully by now you have realized the power of social media.

Happy Learning

Leave a Comment

Your email address will not be published. Required fields are marked *