Lecture 12: Sentiment Analysis [FINISHED]

In the previous lecture we learned how to transform text into numerical data which then opens the door to use all of the tools of ML to answer important business and policy questions. In this lecture we're going to learn how to use this approach to learn about the sentiment embodied in the words. That is, we're going to learn about sentiment analysis which is a technique that detects the underlying sentiment in a piece of text.

Discovering the sentiment of a text will enable us to interpret the text as being positive, negative, or neutral. Conceptually, what we'll be doing is looking for key words which are predictive of a text being perceived as being positive, negative, or neutral. In econometrics we would call this identification. Note that we did this when we looked at the word clouds in the previous lecture and noticed that the word "Free" was predictive of a text being spam.

The agenda for today's lecture is as follows:

Why is Sentiment Analysis Useful
An Example: Reviews of Amazon Fine Foods
Classifying Reviews
Building a ML Model
Identifying Pivotal Words

1. Why is Sentiment Analysis Useful ¶

Sentiment analysis is essential for businesses to gauge customer response. For example, suppose a company has just released a new product that is being advertised on a number of different channels. To gauge their customers' response to this product, the firm could do sentiment analysis based on online reviews and social media since customers often use these media to voice their opinions and experience. These data can be collected and analyzed to gauge overall customer response.

Now take this a step further: We can also examine trends in the data. For example, customers of a certain age group and demographic may respond more favorably to a certain product of a product's characteristic than others.

By collecting and analyzing these data, companies can develop and position their products better. Note that consumers can be better off as well since the firms are delivering better products (ie, products which better address their interests).

2. An Example: Reviews of Amazon Fine Foods ¶

We'll illustrate sentiment analysis via an example: Amazon Fine Food Reviews.

Reviews.csv consists of online reviews of fine foods from Amazon. The data span a period of more than 10 years, including all ~500,000 reviews up to October 2012. Reviews include product and user information, ratings, and a plain text review. It also includes reviews from all other Amazon categories. Let's load the data and see what we have:

In [1]:

import pandas as pd
df = pd.read_csv('./Data/Reviews.csv',nrows=10000)
df.head()

Out[1]:

	Id	ProductId	UserId	ProfileName	HelpfulnessNumerator	HelpfulnessDenominator	Score	Time	Summary	Text
0	1	B001E4KFG0	A3SGXH7AUHU8GW	delmartian	1	1	5	1303862400	Good Quality Dog Food	I have bought several of the Vitality canned d...
1	2	B00813GRG4	A1D87F6ZCVE5NK	dll pa	0	0	1	1346976000	Not as Advertised	Product arrived labeled as Jumbo Salted Peanut...
2	3	B000LQOCH0	ABXLMWJIXXAIN	Natalia Corres "Natalia Corres"	1	1	4	1219017600	"Delight" says it all	This is a confection that has been around a fe...
3	4	B000UA0QIQ	A395BORC6FGVXV	Karl	3	3	2	1307923200	Cough Medicine	If you are looking for the secret ingredient i...
4	5	B006K2ZZ7K	A1UQRSCLF8GW1T	Michael D. Bigham "M. Wassir"	0	0	5	1350777600	Great taffy	Great taffy at a great price. There was a wid...

We can see that the dataframe contains some product, user, and review information.

The most useful data for us will will be:

Text: This variable contains the complete product review information.
Summary: This is a summary of the entire review.
Score: The product rating provided by the customer.

In [2]:

df = df[['Text','Summary','Score']]

Let's do some EDA to see how the scores vary:

In [3]:

import matplotlib.pyplot as plt 
import seaborn as sns

fig, ax = plt.subplots(figsize=(10,5)) 

# historgram
sns.countplot(df['Score'],ax=ax, color='silver')

for p in ax.patches:
    ax.annotate(format(p.get_height(), ',.0f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   size=12,
                   xytext = (0, 8), 
                   textcoords = 'offset points')
    
sns.despine(ax=ax)
ax.set_xlabel('Review Rating',size=16)
ax.set_ylabel('Frequency',size=16)
ax.set_title('Distribution of Review Scores',size=20)
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])
plt.show()

C:\Users\jt83241\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
C:\Users\jt83241\AppData\Local\Temp\ipykernel_46564\2550982372.py:21: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])

Most of the customer ratings are positive though there is some variation which will be good (ie, needed ) for the variation the delineate positive from negative reviews. Let's take a look at the words themselves using word clouds.

In [4]:

from wordcloud import WordCloud
import nltk
from nltk.corpus import stopwords

# Let's eliminate stopwords:
stopwords = set(stopwords.words('english'))
stopwords.update(["br", "href","good","great"]) # there are some html words we should get rid of. Good and great also
                                                # also show up in negative sentiment so I'm dropping those.

# Positive Sentiment
words = ' '.join(list(df[df['Score'] == 5]['Text']))
pos = WordCloud(stopwords=stopwords).generate(words)

# Negative Sentiment
words = ' '.join(list(df[df['Score'] == 1]['Text']))
neg = WordCloud(stopwords=stopwords).generate(words)

In [5]:

fig, ax = plt.subplots(1,2,figsize = (15, 10))
ax[0].imshow(pos)
ax[0].axis('off')
ax[0].set_title('Score = 5',size=18)

ax[1].imshow(neg)
ax[1].axis('off')
ax[1].set_title('Score = 1',size=18)

plt.subplots_adjust(hspace=.0)
plt.show()

Some popular positive words observed here include "taste," "product," "love," and "Amazon." Negative reeviews are definitely different but nothing is overtly negative about them.

3. Classifying Reviews ¶

Our next step is to classify reviews into "positive" and "negative" using the customer scores.

We'll call positive reviews any review with a Score greater than three.
Negative reviews will be any reviews with a Score less than three.

We'll drop neutral scores (Score of three).

In [6]:

print(df.shape[0])
df = df[df['Score'] != 3]
df['Sentiment'] = df['Score'].apply(lambda rating : 1 if rating > 3 else 0)
print(df.shape[0])

10000
9138

Let's do the word cloud thing again with our new sentiment indicator.

In [7]:

# Positive Sentiment
words = ' '.join(list(df[df['Sentiment'] == 1]['Text']))
pos = WordCloud(stopwords=stopwords).generate(words)

# Negative Sentiment
words = ' '.join(list(df[df['Sentiment'] == 0]['Text']))
neg = WordCloud(stopwords=stopwords).generate(words)

fig, ax = plt.subplots(1,2,figsize = (15, 10))
ax[0].imshow(pos)
ax[0].axis('off')
ax[0].set_title('Positive Sentiment',size=18)

ax[1].imshow(neg)
ax[1].axis('off')
ax[1].set_title('Negative Sentiment',size=18)

plt.subplots_adjust(hspace=.0)
plt.show()

Comments As seen above, the positive sentiment word cloud was full of positive words, such as "love," "best," and "delicious." The negative sentiment word cloud was filled with mostly negative words, such as "disappointed" and "yuck."

Note that words "good” and "great" initially appeared in the negative sentiment word cloud, despite being positive words. This is probably because they were used in a negative context; eg, "not good." If you look at the previous cells, I added these to the stopword list to exclude them from the word clouds.

Let's take look at the distribution of reviews with sentiment across the dataset:

In [8]:

fig, ax = plt.subplots(figsize=(10,5)) 

temp = df.copy()
temp['Sentiment'] = temp['Sentiment'].replace({1 : 'Positive'})
temp['Sentiment'] = temp['Sentiment'].replace({0 : 'Negative'})
sns.countplot(temp['Sentiment'],ax=ax, color='silver',order =['Negative','Positive'])

for p in ax.patches:
    ax.annotate(format(p.get_height(), ',.0f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   size=12,
                   xytext = (0, 8), 
                   textcoords = 'offset points')
    
sns.despine(ax=ax)
ax.set_xlabel('Review Rating',size=16)
ax.set_ylabel('Frequency',size=16)
ax.set_title('Distribution of Customer Sentiment',size=20)
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])
plt.show()

C:\Users\jt83241\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
C:\Users\jt83241\AppData\Local\Temp\ipykernel_46564\2056598025.py:20: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])

4. Building a ML Model ¶

Let's build a sentiment analysis model so we can categorize the reviews. In so doing, we'll add some methodology and rigor to our wordcloud intuition.

Our model will take reviews in as input and will predict whether a review is positive or negative. We did something similar with the spam filter we developed in the previous lecture.

The first thing to do is build a couple useful functions.

In [9]:

from nltk.stem.porter import PorterStemmer
from nltk.tokenize import word_tokenize as wt 
from nltk.corpus import stopwords

def stem_it(x):
    stemmer = PorterStemmer()
    return [stemmer.stem(w) for w in x]

def remove_stops(x):
    stop_wrds = stopwords.words('english')
    temp = []
    for word in x:
        if word not in stop_wrds:
            temp.append(word)
    return temp

Now let's do our pre-processing on the text:

In [10]:

df['Text'] = df['Text'].str.lower()
df['Text'] = df['Text'].str.replace('[^A-Za-z]', ' ', regex=True)
df['Text'] = df['Text'].apply(wt)
df['Text'] = df['Text'].apply(remove_stops)
df['Text'] = df['Text'].apply(stem_it) 

# Join the list we created in each observation of the "Text" field
df['Text'] = df['Text'].str.join(' ')

Now we'll convert the text to bag-of-words via the count CountVectorizer. First, I'll create a vectorizer objet and then create the numpy X matrix using fit_transform and the Y matrix using the original dataframe. I'm doing it this way because we'll need the vectorizer object later to explain the results.

In [11]:

from sklearn.feature_extraction.text import CountVectorizer

# The matrix of word counts. I am limiting the feature matrix to 1000 columns.
matrix = CountVectorizer(max_features=1000)
X = matrix.fit_transform(df['Text']).toarray()

# The outcome data.
y = df['Sentiment']

Split the data into training and testing data using train_test_split.

In [12]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0,test_size=0.2)

Let's train some models to make predictions based on the review text. First, we'll train a simple Naive Bayes model:

In [13]:

from sklearn.naive_bayes import GaussianNB

NB = GaussianNB()
model_NB = NB.fit(X_train, y_train)  # the '.toarray()' argument is because Naive-Bayes doesn't like getting passed sparse grids.
print('The accuracy of the Naive-Bayes ML model is {0:4.2f} percent.'.format(model_NB.score(X_test, y_test)*100))

The accuracy of the Naive-Bayes ML model is 67.23 percent.

That was okay. There wasn't much to tune with that model so let's try doing something more flexible and sophisticated: Tune the logistic model using cross-validation. There are many ways to tune the model (see the manual page via LogisticRegression?) but we'll focus on the regularization prameter "C" where smaller values specify stronger regularization. Recall, regularization is where the model penalizes variables which are not very information (eg, Ridge, LASSO).

In [14]:

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV

grid = GridSearchCV(LogisticRegression(random_state=0,max_iter=1000), {'C': [0.001, 0.01, 0.1, 1, 10, 100]})   # 5-fold CV used as default (ie, 'cv=5')
grid.fit(X_train, y_train) # Do the CV to find the optimal 'C'
print('The best regularization parameter is: ', grid.best_params_)   # Print the best results

model_logit = grid.best_estimator_   # recover the best estimator / model
model_logit.fit(X_train, y_train)    # train the model
print('The accuracy of the tuned logistic ML model is {0:4.2f} percent.'.format(model_logit.score(X_test, y_test)*100))

The best regularization parameter is:  {'C': 0.1}
The accuracy of the tuned logistic ML model is 90.26 percent.

Pretty damn good! I wonder which words were important? Let's dig into the model to understand "why" the model works.

5. Identifying Pivotal Words ¶

An important step when working with machine learning models is debugging. For example, when we are working with text we have to check if we have any noise in our features that affect the predictions like unwanted symbols or numbers. We have to know what is responsible for a prediction and somehow explain the model’s output. In the past, we talked about Feature Importances that can also help us debug a machine learning model but now there is an easier and more functional way to do this.

ELI5 is a python package used to inspect ML classifiers and explain their predictions. Install it via:

pip install eli5

Let's get to work by first finding the implicit weights the model is placing on each word. ELI5 does this by permuting the words with random noise. If replacing a word with noise changes the classification, that word will be important to that classification.

In [15]:

import eli5 

eli5.show_weights(model_logit, vec=matrix)  # we also need to set the vectorizer we have used.

C:\Users\jt83241\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)

Out[15]:

y=1 top features

Weight^?	Feature
+1.147	<BIAS>
+1.054	delici
+1.012	great
+1.010	perfect
+0.925	best
+0.902	excel
+0.900	smooth
+0.878	nice
+0.681	amaz
+0.673	without
+0.666	happi
… 534 more positive …
… 447 more negative …
-0.660	lack
-0.680	away
-0.706	bad
-0.838	weak
-0.845	terribl
-0.902	worst
-0.928	wast
-0.968	return
-1.499	disappoint

Hmmm... just as we saw in the word clouds: Words like "love" are associated with positive reviews while "stale" is associated with negative reviews. These reviews are about food so these word-associations make sense.

ELI5 goes even further. We can assess the importance of words within a review towards the overall sentiment.

In [16]:

review="I love this vendor. The food is great and never stale."
 
eli5.explain_prediction(model_logit, review, vec=matrix)

Out[16]:

y=1 (probability 0.900, score 2.196) top features

Contribution^?	Feature
+1.147	<BIAS>
+1.049	Highlighted in text (sum)

i love this vendor. the food is great and never stale.

C'mon, that is seriously cool!

Practice¶

The file GOPdebate.csv has thousands of tweets about the August 2016 GOP Presidential debate in Ohio. Contributors were asked to do both sentiment analysis and data categorization. Specifically, contributors were asked if the tweet was relevant, which candidate was mentioned, what subject was mentioned, and then what the sentiment was for a given tweet. Your task is to do sentiment analysis on tweets.

Load GOPdebate.csv. There's a lot of stuff there but we'll only need "text" and "sentiment".

In [17]:

debate = pd.read_csv('./Data/GOPdebate.csv')
print(debate.shape)
debate.head()

(13871, 21)

Out[17]:

	id	candidate	candidate_confidence	relevant_yn	relevant_yn_confidence	sentiment	sentiment_confidence	subject_matter	subject_matter_confidence	candidate_gold	...	relevant_yn_gold	retweet_count	sentiment_gold	subject_matter_gold	text	tweet_coord	tweet_created	tweet_id	tweet_location	user_timezone
0	1	No candidate mentioned	1.0	yes	1.0	Neutral	0.6578	None of the above	1.0000	NaN	...	NaN	5	NaN	NaN	RT @NancyLeeGrahn: How did everyone feel about...	NaN	2015-08-07 09:54:46 -0700	629697200650592256	NaN	Quito
1	2	Scott Walker	1.0	yes	1.0	Positive	0.6333	None of the above	1.0000	NaN	...	NaN	26	NaN	NaN	RT @ScottWalker: Didn't catch the full #GOPdeb...	NaN	2015-08-07 09:54:46 -0700	629697199560069120	NaN	NaN
2	3	No candidate mentioned	1.0	yes	1.0	Neutral	0.6629	None of the above	0.6629	NaN	...	NaN	27	NaN	NaN	RT @TJMShow: No mention of Tamir Rice and the ...	NaN	2015-08-07 09:54:46 -0700	629697199312482304	NaN	NaN
3	4	No candidate mentioned	1.0	yes	1.0	Positive	1.0000	None of the above	0.7039	NaN	...	NaN	138	NaN	NaN	RT @RobGeorge: That Carly Fiorina is trending ...	NaN	2015-08-07 09:54:45 -0700	629697197118861312	Texas	Central Time (US & Canada)
4	5	Donald Trump	1.0	yes	1.0	Positive	0.7045	None of the above	1.0000	NaN	...	NaN	156	NaN	NaN	RT @DanScavino: #GOPDebate w/ @realDonaldTrump...	NaN	2015-08-07 09:54:45 -0700	629697196967903232	NaN	Arizona

5 rows × 21 columns

In [18]:

# Keeping only the neccessary columns
debate = debate[['text','sentiment']]
debate.head()

Out[18]:

	text	sentiment
0	RT @NancyLeeGrahn: How did everyone feel about...	Neutral
1	RT @ScottWalker: Didn't catch the full #GOPdeb...	Positive
2	RT @TJMShow: No mention of Tamir Rice and the ...	Neutral
3	RT @RobGeorge: That Carly Fiorina is trending ...	Positive
4	RT @DanScavino: #GOPDebate w/ @realDonaldTrump...	Positive

Create a figure to visualize how many tweets are in each category.

In [19]:

fig, ax = plt.subplots(figsize=(10,5)) 

sns.countplot(debate['sentiment'],ax=ax, color='silver',order =['Negative','Neutral','Positive'])

for p in ax.patches:
    ax.annotate(format(p.get_height(), ',.0f'), 
                   (p.get_x() + p.get_width() / 2., p.get_height()), 
                   ha = 'center', va = 'center', 
                   size=12,
                   xytext = (0, 8), 
                   textcoords = 'offset points')
    
ax.set_ylim(0, 10000)
sns.despine(ax=ax)
ax.set_xlabel('\nTweet Rating',size=16)
ax.set_ylabel('Frequency',size=16)
ax.set_title('Distribution of Viewer Sentiment',size=20)
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])
plt.show()

C:\Users\jt83241\Anaconda3\lib\site-packages\seaborn\_decorators.py:36: FutureWarning: Pass the following variable as a keyword arg: x. From version 0.12, the only valid positional argument will be `data`, and passing other arguments without an explicit keyword will result in an error or misinterpretation.
  warnings.warn(
C:\Users\jt83241\AppData\Local\Temp\ipykernel_46564\4250282688.py:18: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()])

Drop the "neutral" tweets. Convert "positive" and "negative" to one and zero, respectively.

In [20]:

debate = debate[debate['sentiment']!='Neutral']
debate['sentiment'] = debate['sentiment'].replace({'Positive' : 1})
debate['sentiment'] = debate['sentiment'].replace({'Negative' : 0})
debate.head()

Out[20]:

	text	sentiment
1	RT @ScottWalker: Didn't catch the full #GOPdeb...	1
3	RT @RobGeorge: That Carly Fiorina is trending ...	1
4	RT @DanScavino: #GOPDebate w/ @realDonaldTrump...	1
5	RT @GregAbbott_TX: @TedCruz: "On my first day ...	1
6	RT @warriorwoman91: I liked her and was happy ...	0

Compare word clouds for positive and negative tweets.

In [21]:

from nltk.corpus import stopwords

# Let's eliminate stopwords:
stopwords = set(stopwords.words('english'))
stopwords.update(["http", "https","RT","co","GOPDebate","GOPDebates","debate","RWSurferGirl"])

# Positive Sentiment
words = ' '.join(list(debate[debate['sentiment'] == 1]['text']))
pos = WordCloud(stopwords=stopwords).generate(words)

# Negative Sentiment
words = ' '.join(list(debate[debate['sentiment'] == 0]['text']))
neg = WordCloud(stopwords=stopwords).generate(words)

fig, ax = plt.subplots(1,2,figsize = (15, 10))
ax[0].imshow(pos)
ax[0].axis('off')
ax[0].set_title('Positive Sentiment',size=18)

ax[1].imshow(neg)
ax[1].axis('off')
ax[1].set_title('Negative Sentiment',size=18)

plt.subplots_adjust(hspace=.0)
plt.show()

Process the text data:
- remove non-alphabetic characters
- convert to lower case
- tokenize the strings
- stem and lem
- join the tokenized strings

In [22]:

from nltk.corpus import stopwords  # reload "stopwords"

debate['text'] = debate['text'].str.lower()
debate['text'] = debate['text'].str.replace('[^A-Za-z]', ' ', regex=True)
debate['text'] = debate['text'].apply(wt)
debate['text'] = debate['text'].apply(remove_stops)
debate['text'] = debate['text'].apply(stem_it) 

# Join the list we created in each observation of the "Text" field
debate['text'] = debate['text'].str.join(' ')

Convert the text to bag-of-words via the count CountVectorizer (or maybe try TF-IDF using TfidfVectorizer). Create your X matrix using fit_transform and your Y matric using the original df. If you selected a sub-sample as a proof-of-concept, make sure that's reflected in the size of Y (ie, you'll select a subsetof the original df rows)

In [23]:

from sklearn.feature_extraction.text import CountVectorizer  # bag of words
from sklearn.feature_extraction.text import TfidfVectorizer  # tf-idf

matrix = CountVectorizer(max_features=10000)  # bag-of-words
# matrix = TfidfVectorizer(max_features=10000)  # tf-idf

X = matrix.fit_transform(debate['text']).toarray()
y = debate['sentiment']   # numerical category

Split the data into training and testing data using train_test_split.

In [24]:

X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0,test_size=0.2)

Tune and fit a Support Vector Machine (SVM) ML model and evaluate the fit. First, try different values of the regularization parameter "C" and then add to this by trying different values of gamma and different kernels.

In [30]:

from sklearn.svm import SVC

# Define grid of values to do CV. 
param_grid = {'C': [1, 5, 10, 50],
              'gamma': [0.0001, 0.0005, 0.001, 0.005],
              'kernel': ['linear']}

grid = GridSearchCV(SVC(), param_grid)   # 5-fold CV used as default (ie, 'cv=5')

# Fit the model via cross-validation on 
grid.fit(X_train, y_train)
print('The optimal parameters are:')
print(grid.best_params_)

The optimal parameters are:
{'C': 10, 'gamma': 0.005, 'kernel': 'linear'}

In [31]:

# Fit the model at the best parameters
model_svm = grid.best_estimator_   # recover the best estimator / model
model_svm.fit(X_train, y_train)    # train the model
print('The accuracy of the tuned SVM ML model is {0:4.2f} percent.'.format(model_svm.score(X_test, y_test)*100))

The accuracy of the tuned SVM ML model is 82.85 percent.

What words (features) are important in determining whether a tweet is positive or negative?

In [32]:

eli5.show_weights(model_svm, vec=matrix)  # we also need to set the vectorizer we have used.

C:\Users\jt83241\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)

Out[32]:

y=1 top features

Weight^?	Feature
+4.290	malbec
+3.599	makeupbylivixx
+3.024	slone
+2.986	succeed
+2.965	doriabiddl
+2.727	inadvert
+2.672	persian
+2.671	applaus
+2.604	six
+2.485	zjq
+2.467	romney
+2.460	ultim
… 1950 more positive …
… 3656 more negative …
-2.402	gopdebacl
-2.431	showman
-2.562	disappoint
-2.708	thepatriot
-2.811	enforc
-2.886	sobertaci
-3.106	ugh
-3.180	congress

We can use regular expressions and raw string notation to identify tweets which contain the above words. This is good practice to see if we should exclude words and redo the analysis. Whether we do this depends on the objective of the modelling exercise.

Raw strings take the form r'some text', so the only difference between writing a raw string and a regular string is the 'r' in front. Raw strings are actually very useful in programming so they're something to know by themselves.

How to do this? We put our expression(s) in parentheses. For example, if I want to find the word 'loss' I write

r'(loss)'

The r and apostrophes are for the raw string, the () encapsulate the regex, and the regex itself is loss.

The code below identifies and prints tweets which contain any text we're interested in.

In [33]:

cnt = debate['text'].str.findall(r'(allagh)')
debate[cnt.map(len)!=0]['text']

Out[33]:

9032     rt gallagherpreach anyon els wait snl weekend ...
10645    rt gallagherpreach anyon els wait snl weekend ...
10704    rt gallagherpreach anyon els wait snl weekend ...
Name: text, dtype: object

In [34]:

cnt = debate['text'].str.findall(r'(spray)')
debate[cnt.map(len)!=0]['text']

Out[34]:

1418    gopdeb keep spray tan industri aliv thrive bri...
2835                trump like napalm spray candid gopdeb
6220    rt saladinahm part huckabe start scream fetus ...
Name: text, dtype: object

In [35]:

cnt = debate['text'].str.findall(r'(enjoy)')
debate[cnt.map(len)!=0]['text']

Out[35]:

1214                 realdonaldtrump treat us enjoy gopdeb
1443     rt grneyedmandi grown ass adult women right se...
2673     grown ass adult women right seem enjoy treat l...
3378     biggest megynkelli fan world fox gener realli ...
3895     gopdeb great enjoy watch men forgot took lax e...
                               ...                        
9375     rt donniewahlberg enjoy gopdeb look forward de...
9376     rt donniewahlberg enjoy gopdeb look forward de...
9377          enjoy gopdeb look forward democraticdeb next
13336    gal enjoy watch gopdeb without toler peopl att...
13849    rt kaylasmith realli enjoy everyth marcorubio ...
Name: text, Length: 76, dtype: object