Sentiment Analysis on Mask Using Tweet Data

people photo created by rawpixel.com - www.freepik.com


Intent

It has been more than a year, the earth, especially Indonesia, is not doing well. The COVID-19 outbreak has yet to be stopped. Not getting better, in fact recently a new variant of covid-19 was discovered with more spread and risk than the previous variant. One of the ways we can protect ourselves from the spread of the Covid-19 virus is by wearing a mask. It is undeniable that maybe some people are starting to feel uncomfortable using masks or are even more aware of using masks. Therefore, in this notebook, I want to do an analysis to see how people's sentiments towards the use of masks are, whether positive, negative or maybe neutral

Import and Install Some Libraries

In this project, I use several libraries, including nltk, literary, google_trans_new, textblob, and others. For more details, the library I use is in the cell below.

pip install snscrape
pip install Sastrawi
pip install google_trans_new

import pandas as pd
import numpy as np
import csv
import snscrape.modules.twitter as sntwitter
import itertools
import seaborn as sn
import matplotlib.pyplot as plt
import nltk
import re
import string
nltk.download('stopwords')
from nltk.corpus import stopwords
nltk.download('punkt')
from Sastrawi.Stemmer.StemmerFactory import StemmerFactory
from textblob import TextBlob
from google_trans_new import google_translator
from sklearn.metrics import accuracy_score,classification_report,confusion_matrix
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from nltk.sentiment.util import *
nltk.download('vader_lexicon')
from nltk.sentiment.vader import SentimentIntensityAnalyzer

Scrapping Data

I used the 100 data used for the sample to streamline the data collection process. The data that I collect from the tweet is the username, date, and content of the tweet. I use the keyword 'masker' so that the collected tweets are relevant to the topic I want to analyze. The tweets that I want to analyze are Indonesian tweets, so the keyword I use is 'masker' in Indonesian, not 'mask' in English.

df = pd.DataFrame(itertools.islice(sntwitter.TwitterSearchScraper(
   'masker').get_items(), 100))[['username', 'date', 'content']]

Exploratory Data Analysis

The next step is to carry out Exploratory Data Analysis (EDA) to see an overview of the datasets that have been collected, in the form of info related to the datasets, descriptions of dataset contents, data types, and other information as needed.

Showing the top 5 data
Below are the top 5 data from the dataset that has been collected.

df.head()

Display data information
View information from datasets that have been collected. From the output below, it can be seen that the dataset has 3 columns, namely username, date, and content with the number of rows in each column is 100, which means there is no null value. Then there is one column with a DateTime data type, namely date, and two columns with an object data type, namely username and content.

df.info()
class 'pandas.core.frame.DataFrame'
RangeIndex: 100 entries, 0 to 99
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype              
---  ------    --------------  -----              
 0   username  100 non-null    object             
 1   date      100 non-null    datetime64[ns, UTC]
 2   content   100 non-null    object             
dtypes: datetime64[ns, UTC](1), object(2)
memory usage: 2.5+ KB

Data Preprocessing

Before the data is used, I pre-process the obtained dataset. Pre-processing is in the form of deleting unused columns, then processing the content column for sentiment analysis. Preprocessing in this content column is in the form of cleaning the text, eliminating stopwords, stemming, and eliminating punctuation, and translating to English. After that, conduct manual labeling on the dataset.

Delete unused columns
Since the analysis will be applied only to the content column. So I will delete some unused columns, namely the date column, and the username column so that only the content column will be left.

df = df.drop(columns=['username', 'date'])
df.content=df.content.astype(str)
df
0                          @kyuhee___ Pake masker dong.
1     Gw bela2in pake double masker tp masih aja ada...
2     Personil Bhabinkamtibmas Desa Baturiti Polsek ...
3     Itu baru planning ke dpn, coba lu baca\n\nSkrg...
4     @rahasiarif kak arif juga stay safe yah, kalo ...
                            ...                        
95    @Larasati_Nikken Benerr. tp mau gmn lg ya samp...
96    pake masker medis trus di dobel di luar perasa...
97                           Sesek juga pake dua masker
98    @Askrlfess Gapapa. Asal jangan kontak langsung...
99    5M: harus tetap\nMenjaga jarak,\nMencuci tanga...
Name: content, Length: 100, dtype: object
df.to_csv('/content/drive/MyDrive/final-project/dataset/dataset_masker.csv', index=None)

Cleaning
Used to change text to lowercase, then remove username, square bracket [], URL, punctuation, and other punctuation marks.

def clean_text(tweet):
  tweet = tweet.lower()
  tweet = re.sub('@[^\s]+', '', tweet)
  tweet = re.sub('\[.*?\]', '', tweet)
  tweet = re.sub('((www\.[^\s]+)|(https?://[^\s]+))', '', tweet)
  tweet = re.sub('[%s]' % re.escape(string.punctuation), '', tweet)
  tweet = re.sub('\w*\d\w*', '', tweet) 
  tweet = re.sub('[‘’“”…]', '', tweet)
  tweet = re.sub('\n', '', tweet)
  return tweet

tweet = lambda x: clean_text(x)

df['clean1'] = pd.DataFrame(df.content.apply(tweet))
df.head()

Cleaning from some words and stopwords
Then clean up the words rt, rts, and retweet which words are not needed in this analysis. Because these words are words that are commonly used in twitter when someone quotes a tweet. After that just remove stopwords.

additional  = ['rt','rts','retweet']
swords = set().union(stopwords.words('indonesian'), additional)

df['clean2'] = (df['clean1'].apply(lambda x: ' '.join([word for word in x.split() if word not in (swords)])))
df.head()

Stemming
The next step is stemming, in this step the words in the dataset that have been collected are converted into basic words.

text = df['clean2']
factory = StemmerFactory()
stemmer = factory.create_stemmer()
list_hasil = text
output = [(stemmer.stem(token)) for token in list_hasil]

df['clean3'] = output

df

Tokenization
Tokenization is conduted to separate words, symbols, phrases, and other important entities (called tokens) from a text for later analysis.

df['tokens'] = pd.DataFrame(df['clean3'].apply(nltk.word_tokenize))
df

Translate into English
Sentiment analysis will be carried out using the Textblob and Vader methods. Therefore, the data that has been collected and cleaned is translated into English so that later sentiment analysis can be carried out on it.

translator = google_translator()
def translate_column(text, target_language):
  return translator.translate(text, lang_tgt=target_language)
df['clean_english'] = df['clean3'].apply(lambda x: translate_column(x, 'en'))
df.head()
df.to_csv('/content/drive/MyDrive/final-project/dataset/dataset_masker_clean.csv', index=None)

Modelling using Textblob

The first model used is the Textblob library. By using textblob, we can do sentiment analysis. Unfortunately, textblob has not yet been implemented for Indonesian, so sentiment analysis in Indonesian cannot be carried out directly, therefore translation is carried out at the previous step.

df = df.drop(columns=['clean1', 'clean2', 'clean3','tokens'])

df['clean_english'] = df['clean_english'].astype('str')
  def get_polarity(text):
  return TextBlob(text).sentiment.polarity

df['polarity'] = df['clean_english'].apply(get_polarity)

df['sentimen_textblob']=''
df.loc[df.polarity>0,'sentimen_textblob']='positive'
df.loc[df.polarity==0,'sentimen_textblob']='neutral'
df.loc[df.polarity<0,'sentimen_textblob']='negative'

positive=df[df['sentimen_textblob']=="positive"]
print("Positive\t: "+str(positive.shape[0]/(df.shape[0])*100)+"%")
pos=positive.shape[0]/df.shape[0]*100

negative=df[df['sentimen_textblob']=="negative"]
print("Negative\t: "+str(negative.shape[0]/(df.shape[0])*100)+"%")
neg=negative.shape[0]/df.shape[0]*100

neutral=df[df['sentimen_textblob']=="neutral"]
print("Neutral\t\t: "+str(neutral.shape[0]/(df.shape[0])*100)+"%")
net=neutral.shape[0]/df.shape[0]*100
Positive	: 42.0%
Negative	: 15.0%
Neutral		: 43.0%
def plot_bar(a,b):
  plt.figure(figsize = (7,5))
  plt.title('Tweet Sentiment', fontsize = 18)
  colors=('#DF3D0E','#88A8E5','#5AE27F')
  plt.bar(a,b, color = colors, edgecolor = 'black', linewidth = 1)
  plt.xlabel('Sentiment', fontsize = 15)
  plt.ylabel('Value', fontsize = 15)
  plt.xticks(fontsize = 12)
  plt.yticks(fontsize = 12)
  for k, v in b.items():
      plt.text(k, v-5, str(v), fontsize = 12, color = 'black', ha = 'center')

st = df.groupby(['sentimen_textblob']).sum()
st.reset_index(inplace = True)
plot_bar(st['sentimen_textblob'], st['value'])
explode=(0.1,0,0)
labels = 'positive', 'negative','neutral'
sizes=(pos,neg,net)
colors=('#5AE27F','#DF3D0E','#88A8E5')
  
plt.pie(sizes,explode=explode,colors=colors,autopct='%1.1f%%',startangle=120)
plt.legend(labels,loc=(-0.05,0.05),shadow=True)
plt.axis('equal')
plt.savefig("Sentiment_Analysis TextBlob.png")

Modelling with Vader

It uses a list of lexical features (e.g. word) which are labeled as positive or negative according to their semantic orientation to calculate the text sentiment. Vader sentiment returns the probability of a given input sentence to be positive, negative, and neutral. Vader is optimized for social media data and can yield good results when used with data from Twitter, Facebook, etc. As the above result shows the polarity of the word and their probabilities of being pos, neg neu, and compound.

sid = SentimentIntensityAnalyzer()
df['scores'] = df['clean_english'].apply(lambda new_text: sid.polarity_scores(new_text))

df['compound'] = df['scores'].apply(lambda score_dict: score_dict['compound'])
df['sentimen_vader']=''
df.loc[df.compound>0,'sentimen_vader']='positive'
df.loc[df.compound==0,'sentimen_vader']='neutral'
df.loc[df.compound<0,'sentimen_vader']='negative'

positive=df[df['sentimen_vader']=="positive"]
print("Positive\t: "+str(positive.shape[0]/(df.shape[0])*100)+"%")
pos=positive.shape[0]/df.shape[0]*100
 
negative=df[df['sentimen_vader']=="negative"]
print("Negative\t: "+str(negative.shape[0]/(df.shape[0])*100)+"%")
neg=negative.shape[0]/df.shape[0]*100
  
neutral=df[df['sentimen_vader']=="neutral"]
print("Neutral\t\t: "+str(neutral.shape[0]/(df.shape[0])*100)+"%")
net=neutral.shape[0]/df.shape[0]*100
Positive	: 45.0%
Negative	: 20.0%
Neutral		: 35.0%
st = df.groupby(['sentimen_vader']).sum()
st.reset_index(inplace = True)
plot_bar(st['sentimen_vader'], st['value'])

explode=(0,0.1,0)
labels = 'positive', 'negative','neutral'
sizes=(pos,neg,net)
colors=('#5AE27F','#DF3D0E','#88A8E5')
  
plt.pie(sizes,explode=explode,colors=colors,autopct='%1.1f%%',startangle=120)
plt.legend(labels,loc=(-0.05,0.05),shadow=True)
plt.axis('equal')
plt.savefig("Sentiment_Analysis Vader.png")

Accuracy

The graph below shows a comparison between manual labeling and labeling using the TextBlob and Vader models. It can be seen in the graph that the results of sentiment analysis using the Vader model are closer to manual labeling. When viewed based on the accuracy value, the Vader model also has higher accuracy than the Textblob model, namely 79% for the accuracy of the Vader model and 74% for the accuracy of the Textblob model.


print(classification_report(df['label'],df['sentimen_textblob']))

Textblob Accuracy
              precision    recall  f1-score   support

    negative       0.53      0.57      0.55        14
     neutral       0.70      0.83      0.76        36
    positive       0.86      0.72      0.78        50

    accuracy                           0.74       100
   macro avg       0.70      0.71      0.70       100
weighted avg       0.75      0.74      0.74       100
print(classification_report(df['label'],df['sentimen_vader']))

Vader Accuracy
              precision    recall  f1-score   support

    negative       0.50      0.71      0.59        14
     neutral       0.83      0.81      0.82        36
    positive       0.89      0.80      0.84        50

    accuracy                           0.79       100
   macro avg       0.74      0.77      0.75       100
weighted avg       0.81      0.79      0.80       100
new_df = pd.DataFrame(df['label'])
new_df.insert(1, 'metodh', 'manual')
vals = df.values
vals.tolist()
a = df['sentimen_textblob']
for i in a:
  new_df.loc[len(new_df)] = [i, 'textblob']
  
a = df['sentimen_vader']
for i in a:
  new_df.loc[len(new_df)] = [i, 'vader']
  
new_df.insert(1, 'value', '1')
convert_dict = {'value': int}
new_df = new_df.astype(convert_dict)
  
import seaborn as sns
fig1, ax1 = plt.subplots(figsize=(10,5))
sns.countplot(new_df['label'],hue=new_df['metodh'])

Analysis

Because the highest accuracy is obtained by the Vader model, the analysis of sentiment will use the results from the Vader model obtained.

From the graph above, it can be seen that from the 100 tweet data that has been collected, 20 tweets with negative sentiment were obtained, 35 tweets with the neutral sentiment, and 45 with positive sentiment. When viewed from the total data which is only 100, 1 tweet has a big impact. Therefore, with the results of the sentiment I got, I suggest that related parties (the ministry of health and so on) can provide education regarding the use of masks and the benefits of using masks, this may be done by increasing public service advertisements related to the use of masks, it can also create creative content related to the use of masks so that this will encourage people to pay more attention to the use of masks in this pandemic season.