Tapping Twitter: Public Perception of Goods & Service Tax (GST) – Part 1

In the last couple of years, social media has been an influential place to bring out your emotion. In this study, we have analyzed twitter to understand the sentiments and emotions. For this, we have analyzed more than 5 months of the tweet (Jul – Nov) for one of the major Tax reform in India (GST). This is going to be a long, for user simplicity divided into two parts: 1. Timeline analysis and 2. One-Day analysis.



The growing popularity of social media has raised the opportunity for exploring and tracking the response of new reforms and policies in India. Social media has been used profoundly all over the world for analysis of political campaigns, stock market data, new product launch, movie release etc. Many researchers have been analysing the tweets by citizens of a nation on Twitter which is a micro blogging website where users read and write millions of tweets on a variety of topics on daily basis. In this paper, Twitter has been used as a forum to understand the sentiments of citizens of India towards recently launched Goods and Services Tax by Indian Government on 1st July 2017. The emotions of public in terms of anger, anticipation, disgust, fear, joy, sadness, surprise have been extracted based on their live opinion. 

The findings of this study will explore the viability of analysing twitter communication with aim of knowing effect of GST on people of India.

Goods and Services tax (GST)

Goods and Services Tax (GST) is an indirect tax applicable throughout India which replaced multiple cascading taxes levied by the central and state governments. It was introduced as The Constitution (One Hundred and First Amendment) Act 2017, following the passage of Constitution 122nd Amendment Bill. The GST is governed by a GST Council and its Chairman is the Finance Minister of India. Under GST, goods and services is taxed at the following rates, 0%, 5%, 12%, 18%, 28%. There is a special rate of 0.25% on rough precious and semi-precious stones and 3% on gold. The Goods and Services Tax (GST), India’s biggest tax reform in 70 years of independence, was launched on the midnight of 30 June 2017 [2a] by the Prime Minister of India Narendra Modi. The launch was marked by a historic midnight (June 30-July 1, 2017) session of both the houses of parliament convened at the Central Hall of the Parliament.

Members of the Congress boycotted the GST launch altogether. They were joined by members of the Trinamool Congress, Communist Parties of India and the DMK, who reportedly found virtually no difference between the existing taxation system, and therefore claimed that the government was trying to merely rebrand the current taxation system but made it worse for common people by increasing existing rates on common items and reducing rates on luxury items. GST was initially proposed to replace a slew of indirect taxes with a unified tax and was therefore set to dramatically reshape the country’s 2 trillion-dollar economy. However, it has been met with sharp criticism from various fronts

What is Goods and Services Tax bill?

Goods and Services Tax (GST) is defined as the tax levied when a consumer buys a good or service. It is proposed to be a comprehensive indirect tax levy on manufacture, sale and consumption of goods as well as services. GST aims to replace all indirect levied on goods and services by the Indian Central and State governments. GST would subsume with a single comprehensive tax, bringing it all under a single umbrella, eliminating the cascading effect of taxes on the production and distribution prices of goods and services.

When Goods and Services Tax is implemented, there will be 3 kinds of applicable Goods and Service Taxes:

CGST: where the revenue will be collected by the central government

SGST: where the revenue will be collected by the state governments for intra-state sales

IGST: where the revenue will be collected by the central government for inter-state sales

Why is Goods and Services Tax So Important?

The Indian tax structure is divided into two – Direct and Indirect Taxes. Direct Taxes are levies where the liability cannot be passed on to someone else. An example of this is Income Tax where you earn the income and you alone are liable to pay the tax on it.

In the case of Indirect Taxes, the liability of the tax can be passed on to someone else. This means that when the shopkeeper must pay VAT on his sale, he can pass on the liability to the customer. So, in effect, the customer pays the price of the item as well as the VAT on it so the shopkeeper can deposit the VAT to the government. This means that the customer must pay not just the price of the product, but he also pays the tax liability, and therefore, he has a higher outlay when he buys an item.

This happens because the shopkeeper has paid a tax when he bought the item from the wholesaler. To recover that amount, as well as to make up for the VAT he must pay to the government, he passes the liability to the customer who has to pay the additional amount. There is currently no other way for the shopkeeper to recover whatever he pays from his own pocket during transactions and therefore, he has no choice but to pass on the liability to the customer.

Implementation of this perceived to be path breaking tax reforms has taken almost two decades to be conceptualized and implemented. Above image (figure 2) shows the roller coaster ride for GST and major events that has happened in the complete cycle of implementation of GST.


In this case study we’ve analysed the reaction, communication and sentiment of users on twitter after the launch of one of India’s most awaited, path breaking tax reforms which came into effect from 1st July 2017. Objective was to discover patterns & themes of communication, the way in which the platform was used to share information and how it shaped response to the GST by various stakeholders including political parties, government, business community and general public. After successfully finishing our study, the following objectives were achieved:

  • Understand how the sentiment changed in response to the government’s policy announcements.
  • Topic analysis of social media interactions to understand the different subjects of interactions, information sharing and grouping geographies.
  • Understand and validate pain points which were discussed in social media and print news
  • Understand user profile and how their behaviour changed over time [ Most active positive and negative user, network of users]
  • Understand how political parties responded to GST
  • Understand how government twitter handles communicated during GST implementation.
  • Grouping similar messages together with emphasis on predominant themes

This study is done on a set of social interactions covering approximately 4.5 months from 1st of July, 2017 to 12th November 2017. Total number of tweets that were analysed is approximately ~ 3 million. Due to limitation of computing power and volume of the analysis outcome, we included the following as part of project:

  • Analysis of twitter data on important events/dates.
  • Analysis of trend of sentiment/tweets/topics over 4.5 months.

Among analytical approach & tools used, topic Analysis of tweets is done using Latent Dirichlet Allocation (LDA). K-Means & Hierarchical Clustering is employed on the themes of tweets. Tableau and R were used to create visualization.


Below are the various limitations of this study:

  1. The base of any good sentiment analysis is the data and the section from which the data is coming. If would be foolish on our part to say that the twitter analysis alone can measure the sentiment of public on GST. This is a very small cross section of the society who is active on twitter and express their thought on Twitter. It should be carefully analysed with other sources to get the complete picture. Having said that even though it is not complete it still is an indicator and a great start to understand the reaction of public on a big ban policy change and all the learning should be analysed and improvements should be made for future.
  2. Garbage in garbage out. It is very important that the data is cleansed properly. One of the biggest challenge is to separate bots from real user. Tweets of bots can seriously undermine the main objective of any social media analytics study
  3. Other limitations are more from a technical point of view such as the computational power which proved to be a big hindrance in our case and many time we felt handicapped because of it.
  4. Even in the unstructured data we have taken it is still a subset of data that is there on twitter for GST. Any picture and video were not analysed. Our study currently is limited to English language only which again restricts the scope of our study and is very important in our context as many tweets are either in Hinglish (writing Hindi using English characters), or Hindi or any other regional language.
  5. Sentiment analysis still has to go a long way to understand the context in which a sentence is written. Sentence written in capital can mean anger but we have converted everything in small caps. Many such thing can completely change the sentiment analysis. Hence any result that came out of this study should be taken with a pinch of salt.

The Typical Tweet

A tweet is a social media message posted on Twitter.com. It is restricted to 140 characters through twitter is currently experimenting with doubling the length of the tweet from 140 characters to 280 characters. Though most tweets contain mostly text, it is possible to embed URLs, pictures, videos, vines and GIFs.

Tweets contain components called hashtags, which are words that capture the subject of the tweet. They are prefixed by the ‘#’ character. Usernames or handles of those who post are recognized by the ‘@’ symbol. A user can direct a message to another user by adding the handle, with the ‘@’ symbol.

A retweet (‘rt’ for short) is a tweet by a user X that has been shared by user Y to all of Y’s followers. Thus, retweets are a way of measuring how popular the tweet is.

A user can ‘favorite’ a tweet; this is analogous to a ‘Like’ on Facebook. A Reply on twitter means responding to a message or tweet from a person while to retweet is to broadcast (like forwarding an email) a tweet or message posted by a person to others.

There are two ways to reply to tweets. There is a @ reply where you use “@username” in your message. Such replies are public i.e. visible on your Twitter page. If you want to reply privately, you can send a DM (direct message) which is sent only to the recipient like a private e-mail.

Data Analytic Approach

We had a data for approximately 4.5 months with tweets count of approximately 3 million. Typical processes of creating term document matrix and document term matrix (TDM and DTM) to do text analytics was not possible on such a huge data set.

In spite of the above limitation we wanted to build analysis and insights based on 4.5 full month data and didn’t wanted to use subset of this huge data available. We split and stored the data into daily files (contains only particular days tweets data), as we already tested that with current computation power (on a 16GB RAM i7 processor) we can process one day tweet data provided tweet count are not very high for that particular day (>50,000 Tweets). For days when number of tweets were greater than 50,000 we were not able to do the analysis hence, we further divided this day file into two or more and then performed the analysis. Later on, daily files were merged and was later used for further analysis.

We aggregated the data on certain key fields so that we can deep dive and do the trend analysis on the period of 4.5 months. Not to mention that the runtime of these codes proved to be very challenging, so manually running the R scripts was not an option. To overcome this challenge, we created automated script which ran for more than 10~15 hours and created the smaller dataset with key fields only.

Some of key fields on which we did aggregation processes are listed below and the result from this were stored in new files.

  • Aggregated Sentiment score on day to day basis.
  • Aggregated Sentiment score on per user/per day basis
  • Segregated user based on Sentiment score on day and sentiment polarity basis  
  • Created TDM for each day and identified out the top words for each day.
  • Collected data for Retweets [count each day]
  • Categories & Aggregated the Source device where the tweets originates.

On top of this data below analysis were done:

  • Sentiment timeline analysis
  • User profile analysis and sentiment mapping
  • User location analysis
  • Device used for Tweeting and trend across 4.5 months
  • Building word cloud based on intermediate word data dump
  • Important retweets and most retweeted tweets.

To further analyse key word association, sample of most popular tweets were taken and standard text analysis process was run to understand word association.

Data Extraction & Exploration

We have taken data from Twitter only with GST keywords from 1st July to 12th November on each day from 00~24 hours regularly. Twitter allowed user to collect the data only for last 7 days. So, in the data collection process we have missed data on 25th July, 25th -28th September, 6th October and 10th November. Moreover, we have done analysis in English language tweets only though twitter supports other Indian Regional languages which are excluded in this study which turns out to a major limitation of this report.

Obtaining twitter data

To begin ingesting social media data from Twitter, developer account on Twitter is required. Once we have a Twitter account, we need to login to that account using our username and password. Now, simply click on the Create New Application button and enter the requested information. Note that these inputs are neither important nor binding. You simply need to provide a name, description, and website (even just a personal blog) in the required fields. Once finished, you should see a page with a lot of information about your application. Included here is a section called OAuth settings. These are crucial in helping you authenticate your application with Twitter, thus allowing you to mine tweets. More specifically, these bits of information will authenticate you with the Twitter application programming interface (API). You should copy the consumer key, consumer secret, access token and access token secret for future reference. These in any case should not be shared with anyone else.

Setting up the handshake between twitter app and R

The handshake will be done using OAUTH mechanism. For this the consumer secret key, consumer secret, auth token and auth token secret was created using the twitter app management portal and then using setup_twitter_oauth () the connection was established to pull the data

Data cleaning

As the data was extracted each day by using the twitter API with the search string “GST” it resulted in every tweet that contained the string GST in the tweet.

As part of the data cleaning, we had to clean the tweets as GST had different context which were not related to the GST tax implementation in India for example: GST was also implemented in New Zealand, in some instances GST was related to Weather. We tried that all the tweets which were contextually not relevant to our study are removed.

We also tried to filter out activities from bots and used below algorithm to find a candidate tweet/user that can belong to bot:

  1. Unusually high activity and odd timings of posting of tweets
  2. High number of tweets posted very frequently
  3. Based on sentiment scores – Unusually high and low score users were further analysed

These bots were removed from the tweets data on which we did our analysis so that the results are closer to the reality and is not biased by bots.

Text Preparation

The tweets were parsed into a corpus for text analysis. The following steps were executed to clean the corpus and prepare it for further analysis. Only the text portion of the tweet (the actual message) was considered.

Removing numbers: TweetIDs are number generated by Twitter to identify each tweet. Numbers as such don’t serve any purpose for text analysis and hence they are discarded.

Removing URLs & links: Many tweets contained links to webpages and videos elsewhere on the Internet. These were removed with regular expressions.

Removing stopwords: Stopwords are words in English that are commonly used in every sentence, but have no analytical significance. Examples are ‘is’, ‘but’, ‘shall’, ‘by’ etc. These words were removed by matching the corpus with the stopwords list in the tm package of R. Expletives were also removed.

Removing non-English words: The corpus generated after performing the last 3 steps were broken into their constituent words and all frequent non-English words have being analysed and removed from the study.

Stemming words: In text analysis, stemming is ‘the process of reducing inflected (or sometimes derived) words to their word stem, base or root form—generally a written word form’. Stemming is done to reduce inflectional forms and sometimes derivationally related-forms of a word to a common base form. Many methods exist to stem words in a corpus.

Suffix-dropping algorithms: The last parts of all the words get truncated. For example, words like ‘programming’, ‘programmer’, ’programmed’, ‘programmable’ can all be stemmed to the root ‘program’. On the other hand, ‘rescuing’, ‘rescue’, ‘rescued’ are stemmed to form ‘rescu’, which is not a word or a root. This method was chosen for this study for simplicity.

Lemmatisation algorithms: Each word is the determination of the lemma for a word in the corpus. This is done with the understanding of the context, part of speech and the lexicon for the language. For example, ‘better’ is related to ‘good’, ‘running’ is related to ‘walk’ and so on.

N-gram analysis: Each word is broken into a part of its whole by ‘n’ characters, and the one that makes most sense is retained. For example, for n=1 (uni-gram), the letters ‘f’, ’l’, ’o’, ’o’, ’d’ are individually parsed from ‘flood’. For a higher n (say n=5), ‘flood’ is retained from ‘flooding’, although at n=4, ‘ding’ can also be construed as a word. For this study we have considered bi-gram and tri-gram analysis only

Removing punctuation: Punctuation marks make no impact to the analysis of text and are hence removed.

Stripping white-space: Words that have extra white-spaces at the beginning, middle or end are subjected to a regular expression that removes the white-space and retains only the words themselves.

Checking for impure characters: A check on the corpus after the modifications made thus far revealed that some URLs were left behind, due to the removal of white-spaces, numbers and punctuation. Regular expressions were used to remove them.

Interesting fact about the data

Here are few interesting facts about the data:

  • 3 million tweets – were extracted using twitter API, this process has been performed on the daily basis. There were few days where data was corrupted so we eliminated those days from the study
  • 0.4 Million Unique Twitter Users – We have analysed more than 0.4 million users, and conclude the insight in the later section of the documents
  • 135 days of data – Data for approx. 41/2 months (135 days, 1st July to 12th Nov) have been extracted.

Weekly Trends on Number of tweets

Weekly trend of tweets/retweets has given below. We have noticed that there were increase in the activity after the 1st week of October and 2nd week of November – one of the reasons can be as there were major policy announcement and restructuring of rates announced by government in GST Council meeting. Another reason can be that Gujarat elections were round the corner and Twitter became hot ground for mudslinging for the political parties to gain public support.

Retweets Vs Tweets

Retweets were high as compared to the tweets, and most of them were pushed by political parties, media houses and other influential personalities.

Top Trending Retweets Over the period of time

Gurmeetramrahim – His tweets trended over the considerable amount of time due to his large follower base. It has been observed that suddenly this so called trending tweet disappeared. On further analysis we found out that his twitter handle was suspended after his arrest.

OfficeOfRG – Congress leader Rahul Gandhi gave new acronym for the GST (Gabbar Singh Tax) and also called his version of tax as Genuine Simple tax.

Follower base make sure that your tweets spread over the social media platform, and also shows how social media platform were used by Political Parties.

Top Retweeted Tweets
RT @Gurmeetramrahim: GST is a big reform in tax governance to ensure transparency & help fuel the economic growth. Great initiative by hon.…
RT @OfficeOfRG: Congress GST= Genuine Simple Tax
Modi ji’s GST= Gabbar Singh Tax
RT @narendramodi: GST – it is good, it is simple and it benefits 125 crore Indians. https://t.co/9guReCaEWL
RT @narendramodi: I would like to add that GST will also be the best example of cooperative federalism. Together we will take India to new…
RT @OfficeOfRG: Some Suggestions
1. Correct the fundamental flaw in GST architecture to give India a Genuine Simple Tax.
2. Don’t waste…

We further analysed to understand the trend of the popular retweets over the four months. We can clearly see most of the popular retweets are either coming either political statements or these are driven from official government handle. These tweets were written either to influence public opinion in a politically charged environment or to communicate to general people policy benefit and changes which has been done.

Devices Usages for Tweets

Web client is the winner among all the devices used. So we have ignored this from the chart. Among the mobile category “Android” phone is leading device category. Next device used was iPhone. Also on the key events days, use of Android devices seen an increase over twitter.

Word Cloud

Word cloud on shows interesting word – tax, Modi, demonetization etc. Also some interesting acronyms found in the cloud which were given by opposition. For example “GST” = “Gabbar Singh Tax”, and another is “Genuine Simple Tax”. Some words like “help”, “reduce” and “pls” definitely” raise the alarms for the government to address the concern raised over social media platform.

Also some other which appear on the word cloud doesn’t make sense on the first sight. But when viewed then it appears they were related to GST in some other ways.

For example “mersal” is Tamil movie released on 18th Oct, and tweets over this period trended over time.” By consolidating taxes,GST is making ppl aware of their https://t.co/c3b9RS4Tlw people will ask the govt for their due.What’s wrong?#Mersal”.

Another tweet which trended during this period were ” RT @SirJadeja: Tamil Film #Mersal Criticizes #GST & #DeMonetisation. BJP Wants Scenes Deleted. What Happened To Freedom Of Expression Now?…”

Below is the word cloud that was created by using the methodology discussed above.

TimeLine Analysis

Raw data from the month of 1st Jul- 12th Nov 2017 was broken in to smaller day wise categories to understand the emotions of people over a period of time. This further assessed us in studying people’s perception of GST and the steps taken by Govt. to create awareness and develop acceptance to one of the largest reform of the Country. Frequent tweets done by the government was a step taken to spread awareness on the reform and handle user sentiments and create positive environment.

Since the data for the duration was very large it created a technological challenge to compute all data at one time and draw inferences. Studying data over time assisted us in analyzing trends and peoples’ reaction to GST refinements. Timeline analysis gave us the trend of emotions and assisted us analyse how peoples’ reaction and focus areas shifted over time.

Sentiment timeline analysis

Blue (positive) and Orange (negative) sentiment score have been calculated using package “how peoples’ reaction and focus areas shifted over time. It is clearly seen that overall sentiments turn negative in September and October but with the consistent government effort it starts increasing again.

Key Findings

  1. First month of GST implementation was the honeymoon period of implementation – generally people were Positive about it.
  2. Two Announcements of GST council meetings on 18th July and 5th August – had positive impact on opinion.
  3. From middle of August as the bad news of Economy started coming in especially from Government of India (RBI and Economic Survey of India) – moods started to dampen
  4. Real impact on changing opinion can be seen on Twitter emotion as well from first week of September. We can see negative emotion really getting up – This emotion for economy and possible impact on upcoming election and possible dent in Government image, Government responded in massive changed in GST in October and November.
  5. Government has tried to respond in 2nd week of October, we can see some change in emotion on 8th October. 2nd big change was announced on 10th November.
  6. One of the possible reason for more activity in October and November is Gujarat election and visibly aggressive congress party this time sensing a discontent on ground in Gujarat with Patidar unrest and impact on textile/diamond industry due to demonetization and GST on hub of these industry situated in Gujarat.

Monthly Word cloud analysis

With the month wise word cloud, we can clearly see the shifting focus of discussion and the main point of discussion at that point of time.

  1. July word cloud shows the initial euphoria of the public with the words like governance, initiative, reforms, great, impact, initiative and one nation one tax taking centre stage.
  2.  In August the focus has shifted to implementation with total tax collection that happened to the workshop, another focus area we can see is cars as the luxury cars became cheaper with the implementation of GST in its earlier phase which was later corrected. One interesting entry is of Vivegam which became the first film to cross 100 Cr. After implementation of GST.
  3. September features negative words like depression, demo and people are showing anxiety about the filling process and the refund process under GST. Focus is shifted to economy as major economic indicators were out and analysed in this month after GST implementation.
  4. Oct is showing lot of political sparring over GST as Gujarat elections were closing in. Rahul Gandhi vice president of INC abbreviated GST as Gabber Singh Tax which became the focal point of discussion in this month. Mersal was the movie from the south which criticized GST tax in one of the dialogues and twitter showed a lot of traction on this. On the other spectrum of politics BJP cited the GST as good and simple tax which also is highlighted in the word cloud of October.
  5.  November again we can see lot of political activity along with reduction of tax slabs on many items which is reflected in the word cloud for this month. Restaurants were also the talking point as tax rate was drastically slashed for them.

Monthly Word trend

Top 25-word trend will throw the same insight that we can draw from the month by month word cloud. July is all about the initial euphoria of the GST with BJP and one tax one nation at the center stage but as the time passed criticism of GST grows and the water get murkier as opposition parties have snuff the blood in the waters and wanted to take advantage of the situation and with the Gujarat elections coming in the month of December GST on twitter becomes free for all.


Thank you for reaching till here. This blog being long post several interesting insight based on the social media and its importance in our day to day life. We will cover more in-depth analysis in our next post where will share complete one-day analysis twitter analysis.

Happy Learning!!!

Leave a Comment

Your email address will not be published. Required fields are marked *