Lecture 11: Natural Language Processing (NLP) [FINISHED]

In this lecture we're going to shift gears from dealing with numerical data to text data.

Working with text as data is known as Natural Language Processing (NLP). A common use of NLP is categorizing a set of text. Perhaps the most ubiquitous example is a spam filter. It reads the text of a message and determines if it is "spam" or "ham."

We'll employ one simple NLP algorithm known as the bag-of-words algorithm to classify SMS messages as spam. There are more sophisticated methods (see end of the lecture) but this will give us the big idea. More-sophisticated methods typically amount to tweaks on how we process the data or more complex classifier models.

We will use the natural language tool kit (nltk) to help us process the text data. It comes with annaconda, but if you need to install it:

pip install nltk

We'll also be using word clouds to visualize identification so install this as well:

pip install wordcloud

The agenda for today's lecture is as follows:

Motivation: Spam messages
The bag-of-words model
Developing a model to identify spam messages
Term Frequency Inverse Document Frequency (TF-IDF)

1. Motivation: Spam Messages ¶

Spam emails/messages belong to the broad category of unsolicited messages received by a user. Spam occupies unwanted space and bandwidth, amplifies the threat of viruses, and in general exploits a user’s connection to social networks. Plus, they're annoying.

Our goal is to classify a message as spam (unwanted message) or ham (wanted message).

Languages are harder for algorithms to interpret and analyze than numeric data since:

Sentences are not of fixed lengths, but most algorithms require a standard input vector size.
Most algorithms cannot understand words as input: hence, each word needs to be represented by some numeric value.

So our method is:

Pre-processing: Clean up the text. This is the new stuff.
Estimate a classifier model on the training data: Let $\text{word}_{ji}$ be the number of times the word $j$ occurs in message $i$

$$\text{Pr}(\text{message}_{i}=1|\text{words}) = \text{logit}(\beta_{0} + \beta_{1}\text{word}_{1i}+ \beta_{2}\text{word}_{2i}+\cdots)$$

Test the model on the testing data.
Use the estimated model to filter incoming messages

2. The bag-of-words model ¶

A bag-of-words model allows us to extract features from textual data. Since an algorithm doesn't understand language, we need to use a numeric representation for the words. This numeric representation can later be fed to any algorithm for further analysis. There are many ways to do this, bag-of-words is the simplest and provides a good foundation for working with text.

The model is called "bag-of-words" because the order of the words or the structure of the sentence is lost in this model. Only the occurrence or presence of a word matters. Hence, we can think of the model in such a way:

we have a big empty bag
we have a vocabulary (ie, a text or corpus).

We pick up words one by one and put them in the bag, adding to the frequency of their occurrence. We then select the most common words as features for passing through our algorithm of choice. We can therefore view our approach as identifying documents which share similar kinds of words.

Here is an example:

In [1]:

import numpy as np
import pandas as pd

# Corpus is a fancy word for a collection (or a body) of text.
# Label marks a message as spam (1) or not spam (0).

corpus = [('Text 1', 'You have won a prize. Call today to claim.', 1),
          ('Text 2', 'It is your mother. Call me.', 0),
          ('Text 3', 'Are you around today? I need a favor.', 1)]
data = pd.DataFrame(corpus, columns=['Document Number','Text of Documents', 'Label'])
data.head()

Out[1]:

	Document Number	Text of Documents	Label
0	Text 1	You have won a prize. Call today to claim.	1
1	Text 2	It is your mother. Call me.	0
2	Text 3	Are you around today? I need a favor.	1

Even though python is a good with text, we will still need to convert our text into some numeric data to get a classifier model to analyze it. Let's create a matrix with the word counts. Each row of the matrix is an observation (a message) and each column is a word. The cells in the matrix are the number of times that word is found in the message.

scikit gives us the CountVectorizer to do this for us.

In [2]:

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer() 
X = vectorizer.fit_transform(data['Text of Documents'])

print(X.toarray())

[[0 0 1 1 0 1 0 0 0 0 0 1 1 1 1 1 0]
 [0 0 1 0 0 0 1 1 1 1 0 0 0 0 0 0 1]
 [1 1 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0]]

Note that X is an array-like object. We are in the realm of scikit, which doesn't use DataFrames. Let's turn this back into a DataFrame, though, so we can see things clearly.

In [3]:

cols = vectorizer.get_feature_names()

count = pd.DataFrame(X.toarray(), columns=cols, index=data['Document Number'])

count.head()

C:\Users\jt83241\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)

Out[3]:

	are	around	call	claim	favor	have	is	it	me	mother	need	prize	to	today	won	you	your
Document Number
Text 1	0	0	1	1	0	1	0	0	0	0	0	1	1	1	1	1	0
Text 2	0	0	1	0	0	0	1	1	1	1	0	0	0	0	0	0	1
Text 3	1	1	0	0	1	0	0	0	0	0	1	0	0	1	0	1	0

The X array contains our features and the data['Label'] column contains our outcome variable. We now have the data ready to estimate a classifier model (e.g., a logit regression).

Note that this methdology of turning text into data is not limited to classification problems. For example, we could use this approach to connect stock performance with FOMC statements to predict how the Federal Reserve's statements on the economy influence the S&P500, the Dow Jones, individual stocks, and government treasury prices. NLP is a broad topic and a lot of fun.

This dataset is too small to actually fit a model, so let's move on to something bigger.

3. Developing a model to identify spam messages ¶

The dataset that we are using is an SMS spam collection dataset. It contains over 5,500 messages in English. There are two columns. The first column corresponds to the actual text message. The second column tells us whether the text is 'ham' or 'spam'.

In [4]:

dataset = pd.read_csv('./Data/spam.csv')
dataset.rename(columns = {'v1': 'labels', 'v2': 'message'}, inplace = True)
dataset['label'] = dataset['labels'].map({'ham': 0, 'spam': 1})
dataset

Out[4]:

	labels	message	label
0	ham	Go until jurong point, crazy.. Available only ...	0
1	ham	Ok lar... Joking wif u oni...	0
2	spam	Free entry in 2 a wkly comp to win FA Cup fina...	1
3	ham	U dun say so early hor... U c already then say...	0
4	ham	Nah I don't think he goes to usf, he lives aro...	0
...	...	...	...
5567	spam	This is the 2nd time we have tried 2 contact u...	1
5568	ham	Will Ì_ b going to esplanade fr home?	0
5569	ham	Pity, * was in mood for that. So...any other s...	0
5570	ham	The guy did some bitching but I acted like i'd...	0
5571	ham	Rofl. Its true to its name	0

5572 rows × 3 columns

Data Pre-Processing¶

This is the part that makes nlp different from working with numeric data. We need to clean up the text and turn it into a feature matrix.

'I am helping raise $100 for UGA Athens'                     original
'i am helping raise $100 for uga athens'                     homogenize spelling
'i am helping raise for uga athens'                          remove non-alphabetic characters
['i', 'am', 'helping', 'raise',  'for', 'uga', 'athens']     tokenize
['helping', 'raise', 'uga', 'athens']                        remove stop words
['help', 'raise', 'uga', 'athens']                           stem and lem
'help raise uga athens'                                      back to a single string

Then create the feature matrix

help	raise	uga	athens
1	1	1	1

1. Homogenize the capitalization¶

We don't want to worry about 'Hello' not being equal to 'hello'. Let's make everything lowercase.

In [5]:

# 1. Homogenize capitalization
dataset['message'] = dataset['message'].str.lower()
dataset.tail()

Out[5]:

	labels	message	label
5567	spam	this is the 2nd time we have tried 2 contact u...	1
5568	ham	will ì_ b going to esplanade fr home?	0
5569	ham	pity, * was in mood for that. so...any other s...	0
5570	ham	the guy did some bitching but i acted like i'd...	0
5571	ham	rofl. its true to its name	0

2. Remove non-alphabetic characters¶

Our algorithm will only use words to characterize messages. This is not necessary, but it simplifies our approach today. Perhaps messages with numbers in them are more likely to be spam?

We will remove them using a regular expression. We have not covered 'regex' (there is never enough time!) but regex is a powerful string search language that is a part of python and most other programming languages. I have a notebook on regex here which you can work though if you are interested.

The code to remove the non-alphabetic characters is

dataset['message'].str.replace('[^A-Za-z]', ' ', regex=True)

The regex part is the '[^A-Za-z]'. It says: "find everything that is not the letters A through Z or a through z." We replace the non-alphabetic stuff with a space.

In [6]:

# 2. Remove non-alphabetic characters
dataset['message'] = dataset['message'].str.replace('[^A-Za-z]', ' ', regex=True)
dataset.tail()

Out[6]:

	labels	message	label
5567	spam	this is the nd time we have tried contact u...	1
5568	ham	will b going to esplanade fr home	0
5569	ham	pity was in mood for that so any other s...	0
5570	ham	the guy did some bitching but i acted like i d...	0
5571	ham	rofl its true to its name	0

3. Tokenize the strings¶

Break the stings up into lists of words, which are easier to process. This is very similar to using .str.split(' '). Here we use the tokenizer method from nltk. It is a bit more sophisticated than a simple split.

from nltk.tokenize import word_tokenize as wt

We also need to download the punctuation dataset.

In [7]:

import nltk

# 3. Tokenize the strings.

# Get the punctuation. 
nltk.download('punkt')

from nltk.tokenize import word_tokenize as wt 
dataset['message'] = dataset['message'].apply(wt)
dataset.tail()

[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\jt83241\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!

Out[7]:

	labels	message	label
5567	spam	[this, is, the, nd, time, we, have, tried, con...	1
5568	ham	[will, b, going, to, esplanade, fr, home]	0
5569	ham	[pity, was, in, mood, for, that, so, any, othe...	0
5570	ham	[the, guy, did, some, bitching, but, i, acted,...	0
5571	ham	[rofl, its, true, to, its, name]	0

4. Removing stop words¶

Now we eliminate stop words—words in the text that add no specific meaning. They often involve prepositions, helping verbs, and articles (e.g., in, the, an, is). Since these add no value to our model, let's get rid of them.

Fortunately, linguists have already identified stopwords so we can readily identify and exclude them

from nltk.corpus import stopwords
stop_wrds = stopwords.words('english')

stop_wrds is a list of English-language stop words.

We need to loop through the lists and check for stop words. I will write a small function that does the looping and then apply it to the DataFrame's column using .apply().

Again, we need to download the stopwords, first.

In [8]:

# 4. Remove stop words.
nltk.download('stopwords')
from nltk.corpus import stopwords

def remove_stops(x):
    stop_wrds = stopwords.words('english')
    temp = []
    for word in x:
        if word not in stop_wrds:
            temp.append(word)
    return temp

dataset['message'] = dataset['message'].apply(remove_stops)
dataset.tail()

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\jt83241\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Out[8]:

	labels	message	label
5567	spam	[nd, time, tried, contact, u, u, pound, prize,...	1
5568	ham	[b, going, esplanade, fr, home]	0
5569	ham	[pity, mood, suggestions]	0
5570	ham	[guy, bitching, acted, like, interested, buyin...	0
5571	ham	[rofl, true, name]	0

5. Stemming and Lemmatization¶

Words like act, actor, and acting are all versions of the same root word (act). Stemming and lemmatization are techniques used to truncate words in order to get the stem or the base word. The difference between these two methods is that after stemming, the stem may not be an actual word, whereas lemmatization always produces a real world, which results in better interpretation of the corpora by humans.

For example, studies could be stemmed as studi (not a word), but will be lemmatized as study (an existing word). To be honest, this feels like a rabbit hole so I'm treating this stuff as a block box and trusting that the linguists are doing a good job.

Let's stem these words.

In [9]:

# 5. Stemming and lemmatization
from nltk.stem.porter import PorterStemmer

def stem_it(x):
    stemmer = PorterStemmer()
    return [stemmer.stem(w) for w in x]

dataset['message'] = dataset['message'].apply(stem_it) 
dataset.tail()

Out[9]:

	labels	message	label
5567	spam	[nd, time, tri, contact, u, u, pound, prize, c...	1
5568	ham	[b, go, esplanad, fr, home]	0
5569	ham	[piti, mood, suggest]	0
5570	ham	[guy, bitch, act, like, interest, buy, someth,...	0
5571	ham	[rofl, true, name]	0

That seemed like a lot of work, but it always does when we are first learning something. Putting all the code together, the processing is simply:

dataset['message'] = dataset['message'].str.lower()
dataset['message'] = dataset['message'].str.replace('[^A-Za-z]', ' ', regex=True)
dataset['message'] = dataset['message'].apply(wt)
dataset['message'] = dataset['message'].apply(remove_stops)
dataset['message'] = dataset['message'].apply(stem_it)

You could even wrap all that up in a function, too...

Create the feature matrix¶

We are done pre-processing.

Turn the lists of words back into strings.
Create the feature matrix using CountVectorizer.

In [10]:

# The matrix of word counts. I am limiting the feature matrix to 1000 columns.
dataset['message'] = dataset['message'].str.join(' ')
X = CountVectorizer(max_features=1000).fit_transform(dataset['message'])

# The outcome data.
y = dataset['label']

Visualizing Keywords¶

We've seen that data visualization is a handy way of better understanding data variation and that's still true here though the visualization tools will be different. For example, we can make a wordcloud, which represents most common words in a space, with the size of each word proportional to the frequency of its occurrence. To do so, we need to add a new package:

pip install wordcloud

In [11]:

from wordcloud import WordCloud
import matplotlib.pyplot as plt
import seaborn as sns

spam_words = ' '.join(list(dataset[dataset['label'] == 1]['message']))
spam_wc = WordCloud(width = 600,height = 512).generate(spam_words)
plt.figure(figsize = (12, 8), facecolor = 'k')
plt.imshow(spam_wc)
plt.axis('off')
plt.tight_layout(pad = 0)
plt.show()

Estimate the logit model¶

Nothing new here.

Split data into test/train
Estimate on training data
Test on, well, testing data

In [12]:

# Create the train-test split.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0)

In [13]:

# Estimate the model
from sklearn.linear_model import LogisticRegression
model_logit = LogisticRegression(random_state=0).fit(X_train, y_train)

In [14]:

accuracy = model_logit.score(X_test, y_test)
print('The accuracy of the logit model is {0:4.2f} percent.'.format(accuracy*100))

The accuracy of the logit model is 97.99 percent.

Not too shabby!

Practice¶

We're going to practice using the "20 Newsgroup" data set which is

a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his "Newsweeder: Learning to filter netnews" paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.

This data set is located within scikit learn but I've created a csv file you can load.

Download, unzip, and load the file 'newsgroups.csv'. Only import the first 500 rows, Try the nrows option of .read_csv().

article is the message. category code is the newsgroup category code. category is the newsgroup category name.

Our goal: Create a classifier that predicts the category code of an article.

In [15]:

df = pd.read_csv('./Data/newsgroups.csv', nrows=500)
df.head(2)

Out[15]:

	article	category code	category
0	I was wondering if anyone out there could enli...	7	rec.autos
1	A fair number of brave souls who upgraded thei...	4	comp.sys.mac.hardware

Make sure "article" is a string. Use .astype(str).

In [16]:

df['article'] = df['article'].astype('str')

How many articles are there in each category. Looks like it's time for .groupby().

In [17]:

df.groupby('category')['article'].count()

Out[17]:

category
alt.atheism                 26
comp.graphics               23
comp.os.ms-windows.misc     34
comp.sys.ibm.pc.hardware    16
comp.sys.mac.hardware       35
comp.windows.x              22
misc.forsale                27
rec.autos                   21
rec.motorcycles             29
rec.sport.baseball          30
rec.sport.hockey            25
sci.crypt                   23
sci.electronics             30
sci.med                     29
sci.space                   27
soc.religion.christian      33
talk.politics.guns          18
talk.politics.mideast       24
talk.politics.misc          18
talk.religion.misc          10
Name: article, dtype: int64

Process the text data. All the code to do this is gathered in the cell above the Create feature matrix heading.

In [18]:

df['article'] = df['article'].str.lower()
df['article'] = df['article'].str.replace('[^A-Za-z]', ' ', regex=True)
df['article'] = df['article'].apply(wt)
df['article'] = df['article'].apply(remove_stops)
df['article'] = df['article'].apply(stem_it) 
df.head()

Out[18]:

	article	category code	category
0	[wonder, anyon, could, enlighten, car, saw, da...	7	rec.autos
1	[fair, number, brave, soul, upgrad, si, clock,...	4	comp.sys.mac.hardware
2	[well, folk, mac, plu, final, gave, ghost, wee...	4	comp.sys.mac.hardware
3	[weitek, address, phone, number, like, get, in...	1	comp.graphics
4	[articl, c, owcb, n, p, world, std, com, tomba...	14	sci.space

Turn the lists of words in article into strings using .str.join(' ')

In [19]:

df['article'] = df['article'].str.join(' ')

Create the feature matrix.

In [20]:

matrix = CountVectorizer(max_features=10000)  
X = matrix.fit_transform(df['article']).toarray()

Create the outcome variable. (the Series that contains the category codes)

In [21]:

y = df['category code']

Create your testing and training datasets.

In [22]:

X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0,test_size=0.2)

Estimate a logit model.

In [23]:

model_logit = LogisticRegression(random_state=0, max_iter=1000).fit(X_train, y_train)

Check the accuracy of your model by omputing the .score().

In [24]:

print('The accuracy of the logit model is {0:4.2f} percent.'.format(model_logit.score(X_test, y_test)*100))

The accuracy of the logit model is 38.00 percent.

Go back to step 1. and increase the number of rows to 1000. Rerun your code. Does the accuracy improve?

In [25]:

df = pd.read_csv('./Data/newsgroups.csv', nrows=1000)

# Data preparation
df['article'] = df['article'].astype('str')
df['article'] = df['article'].str.lower()
df['article'] = df['article'].str.replace('[^A-Za-z]', ' ', regex=True)
df['article'] = df['article'].apply(wt)
df['article'] = df['article'].apply(remove_stops)
df['article'] = df['article'].apply(stem_it) 
df['article'] = df['article'].str.join(' ')

# Create the features and outcome variable
matrix = CountVectorizer(max_features=10000)  
X = matrix.fit_transform(df['article']).toarray()
y = df['category code']

# Train and test the model
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0,test_size=0.2)
model_logit = LogisticRegression(random_state=0, max_iter=1000).fit(X_train, y_train)
print('The accuracy of the logit model is {0:4.2f} percent.'.format(model_logit.score(X_test, y_test)*100))

The accuracy of the logit model is 51.50 percent.

Try a Random Forest model. Set the n_estimators=100 and random_state=0.

In [26]:

from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

clf = RandomForestClassifier(n_estimators=100,random_state=0)

#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train, y_train)

y_pred = clf.predict(X_test)
print('The accuracy of the random forest model is {0:4.2f} percent.'.format(metrics.accuracy_score(y_test, y_pred)*100))

The accuracy of the random forest model is 49.00 percent.

There are about 20,000 rows in the data file. You can keep adding data, but things are going to get slow. This is why big data projects need special techniques and big computers. That being said, the "big" ideas are the same -- we're just operating at a bigger scale.

In [27]:

import time  # used below as a timer

for nrow in [1000, 5000, 10000, 15000, 'All']:
    start_time = time.time()

    if nrow != 'All':
        print('Loading sample with ' + str(nrow) + ' observations...')
        df = pd.read_csv('./Data/newsgroups.csv', nrows=nrow)
    else:
        print('Loading full sample...')
        df = pd.read_csv('./Data/newsgroups.csv')

    # Data preparation
    df['article'] = df['article'].astype('str')
    df['article'] = df['article'].str.lower()
    df['article'] = df['article'].str.replace('[^A-Za-z]', ' ', regex=True)
    df['article'] = df['article'].apply(wt)
    df['article'] = df['article'].apply(remove_stops)
    df['article'] = df['article'].apply(stem_it) 
    df['article'] = df['article'].str.join(' ')

    # Create the features and outcome variable
    matrix = CountVectorizer(max_features=10000)  
    X = matrix.fit_transform(df['article']).toarray()
    y = df['category code']

    # Train and test the model
    X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0,test_size=0.2)
    model_logit = LogisticRegression(random_state=0, max_iter=1000).fit(X_train, y_train)
    print('The accuracy of the logit model is {0:4.2f} percent.'.format(model_logit.score(X_test, y_test)*100))
    print("Time to run: " + str(round((time.time() - start_time)/60,2)) + ' minutes.' )
    print('\n')

Loading sample with 1000 observations...
The accuracy of the logit model is 51.50 percent.
Time to run: 0.23 minutes.


Loading sample with 5000 observations...
The accuracy of the logit model is 62.50 percent.
Time to run: 1.15 minutes.


Loading sample with 10000 observations...
The accuracy of the logit model is 66.00 percent.
Time to run: 2.38 minutes.


Loading sample with 15000 observations...
The accuracy of the logit model is 66.40 percent.
Time to run: 3.5 minutes.


Loading full sample...
The accuracy of the logit model is 68.44 percent.
Time to run: 5.05 minutes.

4. Term Frequency Inverse Document Frequency (TF-IDF) ¶

Let's try adding a little more intelligence to the words count methodology by replacing bag-of-words with TF-IDF (Term Frequency Inverse Document Frequency) to account for the importance of words. This is very common algorithm to transform text into a meaningful representation of numbers which is used to fit machine algorithm for prediction. The statistic is designed to better define how important a word is for a document, while also taking into account the relation to other documents from the same corpus. This is performed by looking at how many times a word appears into a document while also paying attention to how many times the same word appears in other documents in the corpus.

Let's dig into the details a bit.

TF-IDF consists of two parts :

Term Frequency (TF): how many times a word appears in a document
Inverse Document Frequency (IDF): Inverse document frequency of the word across a set of documents.

Let's demonstrate the idea with the example from before:

In [28]:

corpus = [('Document 1', 'Alot of people like to play football'),
          ('Document 2', 'many like to eat'),
          ('Document 3', 'According to data, many like to sing')]
data = pd.DataFrame(corpus,columns=['Document Number','text of Documents'])
data.head()

Out[28]:

	Document Number	text of Documents
0	Document 1	Alot of people like to play football
1	Document 2	many like to eat
2	Document 3	According to data, many like to sing

To find "term frequency" count the number of times each unique word occurs in the text. This is where we stopped with Bag-of-Words.

In [29]:

vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['text of Documents'])
cols = vectorizer.get_feature_names()
count = pd.DataFrame(X.toarray(), columns=cols, index=['Document 1','Document 2','Document 3'])
count.head()

C:\Users\jt83241\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead.
  warnings.warn(msg, category=FutureWarning)

Out[29]:

	according	alot	data	eat	football	like	many	of	people	play	sing	to
Document 1	0	1	0	0	1	1	0	1	1	1	0	1
Document 2	0	0	0	1	0	1	1	0	0	0	0	1
Document 3	1	0	1	0	0	1	1	0	0	0	1	2

Term frequency tends to give more weightage to common words (eg, "is", "if", "the") so we need something to reduce this effect since common words won't be useful to identify a pattern in the document. For example, if we had a document on sports and another on medicine, words like "Football" and "Hypertension" will be rare and separate (identify) the documents. Zipf's Law tells us that the frequency of any word is inversely proportional to its rank in a frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.

Accordingly, IDF applies more weight to words which occur rarely.

In [30]:

# df is the document frequency (how many dcouments word has occured)
df = np.array(count.astype(bool).sum())

n_samples = len(data)

smooth_idf = True    # smooth_idf is used to avoid divide by zero

df += int(smooth_idf)
n_samples += int(smooth_idf)
idf = np.log(n_samples / df) + 1  # (12,1) vector. apply logs to dampen the effects of words which happen often.

The TF-IDF score is then simply the interaction of the two.

In [31]:

tfidf_before_normalization = count*idf
pd.DataFrame(tfidf_before_normalization, columns=cols, index=['Document 1','Document 2','Document 3'])

Out[31]:

	according	alot	data	eat	football	like	many	of	people	play	sing	to
Document 1	0.000000	1.693147	0.000000	0.000000	1.693147	1.0	0.000000	1.693147	1.693147	1.693147	0.000000	1.0
Document 2	0.000000	0.000000	0.000000	1.693147	0.000000	1.0	1.287682	0.000000	0.000000	0.000000	0.000000	1.0
Document 3	1.693147	0.000000	1.693147	0.000000	0.000000	1.0	1.287682	0.000000	0.000000	0.000000	1.693147	2.0

Scikit-learn then normalizes these values using the l2 norm:

In [32]:

from sklearn.preprocessing import normalize
tf_idf = normalize(tfidf_before_normalization, norm='l2', axis=1)
pd.DataFrame(tf_idf, columns=cols, index=['Document 1','Document 2','Document 3'])

Out[32]:

	according	alot	data	eat	football	like	many	of	people	play	sing	to
Document 1	0.000000	0.41894	0.000000	0.00000	0.41894	0.247433	0.000000	0.41894	0.41894	0.41894	0.000000	0.247433
Document 2	0.000000	0.00000	0.000000	0.66284	0.00000	0.391484	0.504107	0.00000	0.00000	0.00000	0.000000	0.391484
Document 3	0.433452	0.00000	0.433452	0.00000	0.00000	0.256004	0.329651	0.00000	0.00000	0.00000	0.433452	0.512007

That's it. Now use these features to understand some outcome variable of interest in a ML model.

Important: We did TF-IDF by hand in the code above but you can incorporate TF-IDF easilly by importing TfidfVectorizer via

from sklearn.feature_extraction.text import TfidfVectorizer

Then use TfidfVectorizer() instead of CountVectorizer in your code.

Practice¶

Develop the newsgroup model again using the TF-IDF algoorithm and logistic model for the first 500 rows. Does this technique improve the model's effectiveness at classifying the articles?

In [33]:

import time  # used below as a timer
from sklearn.feature_extraction.text import TfidfVectorizer

for nrow in [1000, 5000, 10000, 15000, 'All']:
    start_time = time.time()

    if nrow != 'All':
        print('Loading sample with ' + str(nrow) + ' observations...')
        df = pd.read_csv('./Data/newsgroups.csv', nrows=nrow)
    else:
        print('Loading full sample...')
        df = pd.read_csv('./Data/newsgroups.csv')

    # Data preparation
    df['article'] = df['article'].astype('str')
    df['article'] = df['article'].str.lower()
    df['article'] = df['article'].str.replace('[^A-Za-z]', ' ', regex=True)
    df['article'] = df['article'].apply(wt)
    df['article'] = df['article'].apply(remove_stops)
    df['article'] = df['article'].apply(stem_it) 
    df['article'] = df['article'].str.join(' ')

    # Create the features and outcome variable
    matrix = TfidfVectorizer(max_features=10000)     # TF-IDF
    X = matrix.fit_transform(df['article']).toarray()
    y = df['category code']

    # Train and test the model
    X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0,test_size=0.2)
    model_logit = LogisticRegression(random_state=0, max_iter=1000).fit(X_train, y_train)
    print('The accuracy of the logit model is {0:4.2f} percent.'.format(model_logit.score(X_test, y_test)*100))
    print("Time to run: " + str(round((time.time() - start_time)/60,2)) + ' minutes.' )
    print('\n')

Loading sample with 1000 observations...
The accuracy of the logit model is 49.00 percent.
Time to run: 0.2 minutes.


Loading sample with 5000 observations...
The accuracy of the logit model is 71.20 percent.
Time to run: 0.96 minutes.


Loading sample with 10000 observations...
The accuracy of the logit model is 73.95 percent.
Time to run: 1.94 minutes.


Loading sample with 15000 observations...
The accuracy of the logit model is 71.67 percent.
Time to run: 2.9 minutes.


Loading full sample...
The accuracy of the logit model is 73.69 percent.
Time to run: 3.81 minutes.

Lecture 11: Natural Language Processing (NLP) [FINISHED]

1. Motivation: Spam Messages ¶

2. The bag-of-words model ¶

3. Developing a model to identify spam messages ¶

Data Pre-Processing¶

1. Homogenize the capitalization¶

2. Remove non-alphabetic characters¶

3. Tokenize the strings¶

4. Removing stop words¶

5. Stemming and Lemmatization¶

Create the feature matrix¶

Visualizing Keywords¶

Estimate the logit model¶

Practice¶

4. Term Frequency Inverse Document Frequency (TF-IDF) ¶

Practice¶

Jeff Thurk // jeff.thurk@uga.edu // Department of Economics // University of Georgia