Lecture 12: Sentiment Analysis [FINISHED]
In the previous lecture we learned how to transform text into numerical data which then opens the door to use all of the tools of ML to answer important business and policy questions. In this lecture we're going to learn how to use this approach to learn about the sentiment embodied in the words. That is, we're going to learn about sentiment analysis which is a technique that detects the underlying sentiment in a piece of text.
Discovering the sentiment of a text will enable us to interpret the text as being positive, negative, or neutral. Conceptually, what we'll be doing is looking for key words which are predictive of a text being perceived as being positive, negative, or neutral. In econometrics we would call this identification. Note that we did this when we looked at the word clouds in the previous lecture and noticed that the word "Free" was predictive of a text being spam.
The agenda for today's lecture is as follows:
1. Why is Sentiment Analysis Useful ¶
Sentiment analysis is essential for businesses to gauge customer response. For example, suppose a company has just released a new product that is being advertised on a number of different channels. To gauge their customers' response to this product, the firm could do sentiment analysis based on online reviews and social media since customers often use these media to voice their opinions and experience. These data can be collected and analyzed to gauge overall customer response.
Now take this a step further: We can also examine trends in the data. For example, customers of a certain age group and demographic may respond more favorably to a certain product of a product's characteristic than others.
By collecting and analyzing these data, companies can develop and position their products better. Note that consumers can be better off as well since the firms are delivering better products (ie, products which better address their interests).
2. An Example: Reviews of Amazon Fine Foods ¶
We'll illustrate sentiment analysis via an example: Amazon Fine Food Reviews.
Reviews.csv
consists of online reviews of fine foods from Amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories. Let's load the data and see what we have:
import pandas as pd
df = pd.read_csv('./Data/Reviews.csv',nrows=10000)
df.head()
Id | ProductId | UserId | ProfileName | HelpfulnessNumerator | HelpfulnessDenominator | Score | Time | Summary | Text | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | B001E4KFG0 | A3SGXH7AUHU8GW | delmartian | 1 | 1 | 5 | 1303862400 | Good Quality Dog Food | I have bought several of the Vitality canned d... |
1 | 2 | B00813GRG4 | A1D87F6ZCVE5NK | dll pa | 0 | 0 | 1 | 1346976000 | Not as Advertised | Product arrived labeled as Jumbo Salted Peanut... |
2 | 3 | B000LQOCH0 | ABXLMWJIXXAIN | Natalia Corres "Natalia Corres" | 1 | 1 | 4 | 1219017600 | "Delight" says it all | This is a confection that has been around a fe... |
3 | 4 | B000UA0QIQ | A395BORC6FGVXV | Karl | 3 | 3 | 2 | 1307923200 | Cough Medicine | If you are looking for the secret ingredient i... |
4 | 5 | B006K2ZZ7K | A1UQRSCLF8GW1T | Michael D. Bigham "M. Wassir" | 0 | 0 | 5 | 1350777600 | Great taffy | Great taffy at a great price. There was a wid... |
We can see that the dataframe contains some product, user, and review information.
The most useful data for us will will be:
- Text: This variable contains the complete product review information.
- Summary: This is a summary of the entire review.
- Score: The product rating provided by the customer.
df = df[['Text','Summary','Score']]
Let's do some EDA to see how the scores vary:
import matplotlib.pyplot as plt
import seaborn as sns
fig, ax = plt.subplots(figsize=(10,5))
# historgram
sns.countplot(df['Score'],ax=ax, color='silver')
for p in ax.patches:
ax.annotate(format(p.get_height(), ',.0f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center',
size=12,
xytext = (0, 8),
textcoords = 'offset points')
sns.despine(ax=ax)
ax.set_xlabel('Review Rating',size=16)
ax.set_ylabel('Frequency',size=16)
ax.set_title('Distribution of Review Scores',size=20)
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])
plt.show()
C:\Users\jt83241\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( C:\Users\jt83241\AppData\Local\Temp\ipykernel_46564\2550982372.py:21: UserWarning: FixedFormatter should only be used together with FixedLocator ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])
Most of the customer ratings are positive though there is some variation which will be good (ie, needed ) for the variation the delineate positive from negative reviews. Let's take a look at the words themselves using word clouds.
from wordcloud import WordCloud
import nltk
from nltk.corpus import stopwords
# Let's eliminate stopwords:
stopwords = set(stopwords.words('english'))
stopwords.update(["br", "href","good","great"]) # there are some html words we should get rid of. Good and great also
# also show up in negative sentiment so I'm dropping those.
# Positive Sentiment
words = ' '.join(list(df[df['Score'] == 5]['Text']))
pos = WordCloud(stopwords=stopwords).generate(words)
# Negative Sentiment
words = ' '.join(list(df[df['Score'] == 1]['Text']))
neg = WordCloud(stopwords=stopwords).generate(words)
fig, ax = plt.subplots(1,2,figsize = (15, 10))
ax[0].imshow(pos)
ax[0].axis('off')
ax[0].set_title('Score = 5',size=18)
ax[1].imshow(neg)
ax[1].axis('off')
ax[1].set_title('Score = 1',size=18)
plt.subplots_adjust(hspace=.0)
plt.show()
Some popular positive words observed here include "taste," "product," "love," and "Amazon." Negative reeviews are definitely different but nothing is overtly negative about them.
3. Classifying Reviews ¶
Our next step is to classify reviews into "positive" and "negative" using the customer scores.
We'll call positive reviews any review with a
Score
greater than three.Negative reviews will be any reviews with a
Score
less than three.
We'll drop neutral scores (Score
of three).
print(df.shape[0])
df = df[df['Score'] != 3]
df['Sentiment'] = df['Score'].apply(lambda rating : 1 if rating > 3 else 0)
print(df.shape[0])
10000 9138
Let's do the word cloud thing again with our new sentiment indicator.
# Positive Sentiment
words = ' '.join(list(df[df['Sentiment'] == 1]['Text']))
pos = WordCloud(stopwords=stopwords).generate(words)
# Negative Sentiment
words = ' '.join(list(df[df['Sentiment'] == 0]['Text']))
neg = WordCloud(stopwords=stopwords).generate(words)
fig, ax = plt.subplots(1,2,figsize = (15, 10))
ax[0].imshow(pos)
ax[0].axis('off')
ax[0].set_title('Positive Sentiment',size=18)
ax[1].imshow(neg)
ax[1].axis('off')
ax[1].set_title('Negative Sentiment',size=18)
plt.subplots_adjust(hspace=.0)
plt.show()
Comments As seen above, the positive sentiment word cloud was full of positive words, such as "love," "best," and "delicious." The negative sentiment word cloud was filled with mostly negative words, such as "disappointed" and "yuck."
Note that words "good” and "great" initially appeared in the negative sentiment word cloud, despite being positive words. This is probably because they were used in a negative context; eg, "not good." If you look at the previous cells, I added these to the stopword list to exclude them from the word clouds.
Let's take look at the distribution of reviews with sentiment across the dataset:
fig, ax = plt.subplots(figsize=(10,5))
temp = df.copy()
temp['Sentiment'] = temp['Sentiment'].replace({1 : 'Positive'})
temp['Sentiment'] = temp['Sentiment'].replace({0 : 'Negative'})
sns.countplot(temp['Sentiment'],ax=ax, color='silver',order =['Negative','Positive'])
for p in ax.patches:
ax.annotate(format(p.get_height(), ',.0f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center',
size=12,
xytext = (0, 8),
textcoords = 'offset points')
sns.despine(ax=ax)
ax.set_xlabel('Review Rating',size=16)
ax.set_ylabel('Frequency',size=16)
ax.set_title('Distribution of Customer Sentiment',size=20)
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])
plt.show()
C:\Users\jt83241\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( C:\Users\jt83241\AppData\Local\Temp\ipykernel_46564\2056598025.py:20: UserWarning: FixedFormatter should only be used together with FixedLocator ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])
4. Building a ML Model ¶
Let's build a sentiment analysis model so we can categorize the reviews. In so doing, we'll add some methodology and rigor to our wordcloud intuition.
Our model will take reviews in as input and will predict whether a review is positive or negative. We did something similar with the spam filter we developed in the previous lecture.
The first thing to do is build a couple useful functions.
from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize as wt
from nltk.corpus import stopwords
def stem_it(x):
stemmer = PorterStemmer()
return [stemmer.stem(w) for w in x]
def remove_stops(x):
stop_wrds = stopwords.words('english')
temp = []
for word in x:
if word not in stop_wrds:
temp.append(word)
return temp
Now let's do our pre-processing on the text:
df['Text'] = df['Text'].str.lower()
df['Text'] = df['Text'].str.replace('[^A-Za-z]', ' ', regex=True)
df['Text'] = df['Text'].apply(wt)
df['Text'] = df['Text'].apply(remove_stops)
df['Text'] = df['Text'].apply(stem_it)
# Join the list we created in each observation of the "Text" field
df['Text'] = df['Text'].str.join(' ')
Now we'll convert the text to bag-of-words via the count CountVectorizer
. First, I'll create a vectorizer objet and then create the numpy X matrix using fit_transform
and the Y matrix using the original dataframe. I'm doing it this way because we'll need the vectorizer object later to explain the results.
from sklearn.feature_extraction.text import CountVectorizer
# The matrix of word counts. I am limiting the feature matrix to 1000 columns.
matrix = CountVectorizer(max_features=1000)
X = matrix.fit_transform(df['Text']).toarray()
# The outcome data.
y = df['Sentiment']
Split the data into training and testing data using train_test_split
.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0,test_size=0.2)
Let's train some models to make predictions based on the review text. First, we'll train a simple Naive Bayes model:
from sklearn.naive_bayes import GaussianNB
NB = GaussianNB()
model_NB = NB.fit(X_train, y_train) # the '.toarray()' argument is because Naive-Bayes doesn't like getting passed sparse grids.
print('The accuracy of the Naive-Bayes ML model is {0:4.2f} percent.'.format(model_NB.score(X_test, y_test)*100))
The accuracy of the Naive-Bayes ML model is 67.23 percent.
That was okay. There wasn't much to tune with that model so let's try doing something more flexible and sophisticated: Tune the logistic model using cross-validation. There are many ways to tune the model (see the manual page via LogisticRegression?
) but we'll focus on the regularization prameter "C" where smaller values specify stronger regularization. Recall, regularization is where the model penalizes variables which are not very information (eg, Ridge, LASSO).
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(LogisticRegression(random_state=0,max_iter=1000), {'C': [0.001, 0.01, 0.1, 1, 10, 100]}) # 5-fold CV used as default (ie, 'cv=5')
grid.fit(X_train, y_train) # Do the CV to find the optimal 'C'
print('The best regularization parameter is: ', grid.best_params_) # Print the best results
model_logit = grid.best_estimator_ # recover the best estimator / model
model_logit.fit(X_train, y_train) # train the model
print('The accuracy of the tuned logistic ML model is {0:4.2f} percent.'.format(model_logit.score(X_test, y_test)*100))
The best regularization parameter is: {'C': 0.1} The accuracy of the tuned logistic ML model is 90.26 percent.
Pretty damn good! I wonder which words were important? Let's dig into the model to understand "why" the model works.
5. Identifying Pivotal Words ¶
An important step when working with machine learning models is debugging. For example, when we are working with text we have to check if we have any noise in our features that affect the predictions like unwanted symbols or numbers. We have to know what is responsible for a prediction and somehow explain the model’s output. In the past, we talked about Feature Importances that can also help us debug a machine learning model but now there is an easier and more functional way to do this.
ELI5 is a python package used to inspect ML classifiers and explain their predictions. Install it via:
pip install eli5
Let's get to work by first finding the implicit weights the model is placing on each word. ELI5 does this by permuting the words with random noise. If replacing a word with noise changes the classification, that word will be important to that classification.
import eli5
eli5.show_weights(model_logit, vec=matrix) # we also need to set the vectorizer we have used.
C:\Users\jt83241\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead. warnings.warn(msg, category=FutureWarning)
y=1 top features
Weight? | Feature |
---|---|
+1.147 | <BIAS> |
+1.054 | delici |
+1.012 | great |
+1.010 | perfect |
+0.925 | best |
+0.902 | excel |
+0.900 | smooth |
+0.878 | nice |
+0.681 | amaz |
+0.673 | without |
+0.666 | happi |
… 534 more positive … | |
… 447 more negative … | |
-0.660 | lack |
-0.680 | away |
-0.706 | bad |
-0.838 | weak |
-0.845 | terribl |
-0.902 | worst |
-0.928 | wast |
-0.968 | return |
-1.499 | disappoint |
Hmmm... just as we saw in the word clouds: Words like "love" are associated with positive reviews while "stale" is associated with negative reviews. These reviews are about food so these word-associations make sense.
ELI5 goes even further. We can assess the importance of words within a review towards the overall sentiment.
review="I love this vendor. The food is great and never stale."
eli5.explain_prediction(model_logit, review, vec=matrix)
y=1 (probability 0.900, score 2.196) top features
Contribution? | Feature |
---|---|
+1.147 | <BIAS> |
+1.049 | Highlighted in text (sum) |
i love this vendor. the food is great and never stale.
C'mon, that is seriously cool!
Practice¶
The file GOPdebate.csv
has thousands of tweets about the August 2016 GOP Presidential debate in Ohio. Contributors were asked to do both sentiment analysis and data categorization. Specifically, contributors were asked if the tweet was relevant, which candidate was mentioned, what subject was mentioned, and then what the sentiment was for a given tweet. Your task is to do sentiment analysis on tweets.
- Load
GOPdebate.csv
. There's a lot of stuff there but we'll only need "text" and "sentiment".
debate = pd.read_csv('./Data/GOPdebate.csv')
print(debate.shape)
debate.head()
(13871, 21)
id | candidate | candidate_confidence | relevant_yn | relevant_yn_confidence | sentiment | sentiment_confidence | subject_matter | subject_matter_confidence | candidate_gold | ... | relevant_yn_gold | retweet_count | sentiment_gold | subject_matter_gold | text | tweet_coord | tweet_created | tweet_id | tweet_location | user_timezone | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | No candidate mentioned | 1.0 | yes | 1.0 | Neutral | 0.6578 | None of the above | 1.0000 | NaN | ... | NaN | 5 | NaN | NaN | RT @NancyLeeGrahn: How did everyone feel about... | NaN | 2015-08-07 09:54:46 -0700 | 629697200650592256 | NaN | Quito |
1 | 2 | Scott Walker | 1.0 | yes | 1.0 | Positive | 0.6333 | None of the above | 1.0000 | NaN | ... | NaN | 26 | NaN | NaN | RT @ScottWalker: Didn't catch the full #GOPdeb... | NaN | 2015-08-07 09:54:46 -0700 | 629697199560069120 | NaN | NaN |
2 | 3 | No candidate mentioned | 1.0 | yes | 1.0 | Neutral | 0.6629 | None of the above | 0.6629 | NaN | ... | NaN | 27 | NaN | NaN | RT @TJMShow: No mention of Tamir Rice and the ... | NaN | 2015-08-07 09:54:46 -0700 | 629697199312482304 | NaN | NaN |
3 | 4 | No candidate mentioned | 1.0 | yes | 1.0 | Positive | 1.0000 | None of the above | 0.7039 | NaN | ... | NaN | 138 | NaN | NaN | RT @RobGeorge: That Carly Fiorina is trending ... | NaN | 2015-08-07 09:54:45 -0700 | 629697197118861312 | Texas | Central Time (US & Canada) |
4 | 5 | Donald Trump | 1.0 | yes | 1.0 | Positive | 0.7045 | None of the above | 1.0000 | NaN | ... | NaN | 156 | NaN | NaN | RT @DanScavino: #GOPDebate w/ @realDonaldTrump... | NaN | 2015-08-07 09:54:45 -0700 | 629697196967903232 | NaN | Arizona |
5 rows × 21 columns
# Keeping only the neccessary columns
debate = debate[['text','sentiment']]
debate.head()
text | sentiment | |
---|---|---|
0 | RT @NancyLeeGrahn: How did everyone feel about... | Neutral |
1 | RT @ScottWalker: Didn't catch the full #GOPdeb... | Positive |
2 | RT @TJMShow: No mention of Tamir Rice and the ... | Neutral |
3 | RT @RobGeorge: That Carly Fiorina is trending ... | Positive |
4 | RT @DanScavino: #GOPDebate w/ @realDonaldTrump... | Positive |
- Create a figure to visualize how many tweets are in each category.
fig, ax = plt.subplots(figsize=(10,5))
sns.countplot(debate['sentiment'],ax=ax, color='silver',order =['Negative','Neutral','Positive'])
for p in ax.patches:
ax.annotate(format(p.get_height(), ',.0f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center', va = 'center',
size=12,
xytext = (0, 8),
textcoords = 'offset points')
ax.set_ylim(0, 10000)
sns.despine(ax=ax)
ax.set_xlabel('\nTweet Rating',size=16)
ax.set_ylabel('Frequency',size=16)
ax.set_title('Distribution of Viewer Sentiment',size=20)
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])
plt.show()
C:\Users\jt83241\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation. warnings.warn( C:\Users\jt83241\AppData\Local\Temp\ipykernel_46564\4250282688.py:18: UserWarning: FixedFormatter should only be used together with FixedLocator ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])
- Drop the "neutral" tweets. Convert "positive" and "negative" to one and zero, respectively.
debate = debate[debate['sentiment']!='Neutral']
debate['sentiment'] = debate['sentiment'].replace({'Positive' : 1})
debate['sentiment'] = debate['sentiment'].replace({'Negative' : 0})
debate.head()
text | sentiment | |
---|---|---|
1 | RT @ScottWalker: Didn't catch the full #GOPdeb... | 1 |
3 | RT @RobGeorge: That Carly Fiorina is trending ... | 1 |
4 | RT @DanScavino: #GOPDebate w/ @realDonaldTrump... | 1 |
5 | RT @GregAbbott_TX: @TedCruz: "On my first day ... | 1 |
6 | RT @warriorwoman91: I liked her and was happy ... | 0 |
- Compare word clouds for positive and negative tweets.
from nltk.corpus import stopwords
# Let's eliminate stopwords:
stopwords = set(stopwords.words('english'))
stopwords.update(["http", "https","RT","co","GOPDebate","GOPDebates","debate","RWSurferGirl"])
# Positive Sentiment
words = ' '.join(list(debate[debate['sentiment'] == 1]['text']))
pos = WordCloud(stopwords=stopwords).generate(words)
# Negative Sentiment
words = ' '.join(list(debate[debate['sentiment'] == 0]['text']))
neg = WordCloud(stopwords=stopwords).generate(words)
fig, ax = plt.subplots(1,2,figsize = (15, 10))
ax[0].imshow(pos)
ax[0].axis('off')
ax[0].set_title('Positive Sentiment',size=18)
ax[1].imshow(neg)
ax[1].axis('off')
ax[1].set_title('Negative Sentiment',size=18)
plt.subplots_adjust(hspace=.0)
plt.show()
- Process the text data:
- remove non-alphabetic characters
- convert to lower case
- tokenize the strings
- stem and lem
- join the tokenized strings
from nltk.corpus import stopwords # reload "stopwords"
debate['text'] = debate['text'].str.lower()
debate['text'] = debate['text'].str.replace('[^A-Za-z]', ' ', regex=True)
debate['text'] = debate['text'].apply(wt)
debate['text'] = debate['text'].apply(remove_stops)
debate['text'] = debate['text'].apply(stem_it)
# Join the list we created in each observation of the "Text" field
debate['text'] = debate['text'].str.join(' ')
- Convert the text to bag-of-words via the count
CountVectorizer
(or maybe try TF-IDF usingTfidfVectorizer
). Create your X matrix usingfit_transform
and your Y matric using the original df. If you selected a sub-sample as a proof-of-concept, make sure that's reflected in the size of Y (ie, you'll select a subsetof the original df rows)
from sklearn.feature_extraction.text import CountVectorizer # bag of words
from sklearn.feature_extraction.text import TfidfVectorizer # tf-idf
matrix = CountVectorizer(max_features=10000) # bag-of-words
# matrix = TfidfVectorizer(max_features=10000) # tf-idf
X = matrix.fit_transform(debate['text']).toarray()
y = debate['sentiment'] # numerical category
- Split the data into training and testing data using
train_test_split
.
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0,test_size=0.2)
- Tune and fit a Support Vector Machine (SVM) ML model and evaluate the fit. First, try different values of the regularization parameter "C" and then add to this by trying different values of
gamma
and differentkernels
.
from sklearn.svm import SVC
# Define grid of values to do CV.
param_grid = {'C': [1, 5, 10, 50],
'gamma': [0.0001, 0.0005, 0.001, 0.005],
'kernel': ['linear']}
grid = GridSearchCV(SVC(), param_grid) # 5-fold CV used as default (ie, 'cv=5')
# Fit the model via cross-validation on
grid.fit(X_train, y_train)
print('The optimal parameters are:')
print(grid.best_params_)
The optimal parameters are: {'C': 10, 'gamma': 0.005, 'kernel': 'linear'}
# Fit the model at the best parameters
model_svm = grid.best_estimator_ # recover the best estimator / model
model_svm.fit(X_train, y_train) # train the model
print('The accuracy of the tuned SVM ML model is {0:4.2f} percent.'.format(model_svm.score(X_test, y_test)*100))
The accuracy of the tuned SVM ML model is 82.85 percent.
- What words (features) are important in determining whether a tweet is positive or negative?
eli5.show_weights(model_svm, vec=matrix) # we also need to set the vectorizer we have used.
C:\Users\jt83241\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead. warnings.warn(msg, category=FutureWarning)
y=1 top features
Weight? | Feature |
---|---|
+4.290 | malbec |
+3.599 | makeupbylivixx |
+3.024 | slone |
+2.986 | succeed |
+2.965 | doriabiddl |
+2.727 | inadvert |
+2.672 | persian |
+2.671 | applaus |
+2.604 | six |
+2.485 | zjq |
+2.467 | romney |
+2.460 | ultim |
… 1950 more positive … | |
… 3656 more negative … | |
-2.402 | gopdebacl |
-2.431 | showman |
-2.562 | disappoint |
-2.708 | thepatriot |
-2.811 | enforc |
-2.886 | sobertaci |
-3.106 | ugh |
-3.180 | congress |
We can use regular expressions and raw string notation to identify tweets which contain the above words. This is good practice to see if we should exclude words and redo the analysis. Whether we do this depends on the objective of the modelling exercise.
Raw strings take the form r'some text', so the only difference between writing a raw string and a regular string is the 'r' in front. Raw strings are actually very useful in programming so they're something to know by themselves.
How to do this? We put our expression(s) in parentheses. For example, if I want to find the word 'loss' I write
r'(loss)'
The r
and apostrophes are for the raw string, the ()
encapsulate the regex, and the regex itself is loss
.
The code below identifies and prints tweets which contain any text we're interested in.
cnt = debate['text'].str.findall(r'(allagh)')
debate[cnt.map(len)!=0]['text']
9032 rt gallagherpreach anyon els wait snl weekend ... 10645 rt gallagherpreach anyon els wait snl weekend ... 10704 rt gallagherpreach anyon els wait snl weekend ... Name: text, dtype: object
cnt = debate['text'].str.findall(r'(spray)')
debate[cnt.map(len)!=0]['text']
1418 gopdeb keep spray tan industri aliv thrive bri... 2835 trump like napalm spray candid gopdeb 6220 rt saladinahm part huckabe start scream fetus ... Name: text, dtype: object
cnt = debate['text'].str.findall(r'(enjoy)')
debate[cnt.map(len)!=0]['text']
1214 realdonaldtrump treat us enjoy gopdeb 1443 rt grneyedmandi grown ass adult women right se... 2673 grown ass adult women right seem enjoy treat l... 3378 biggest megynkelli fan world fox gener realli ... 3895 gopdeb great enjoy watch men forgot took lax e... ... 9375 rt donniewahlberg enjoy gopdeb look forward de... 9376 rt donniewahlberg enjoy gopdeb look forward de... 9377 enjoy gopdeb look forward democraticdeb next 13336 gal enjoy watch gopdeb without toler peopl att... 13849 rt kaylasmith realli enjoy everyth marcorubio ... Name: text, Length: 76, dtype: object