Lecture 11: Natural Language Processing (NLP) [FINISHED]
In this lecture we're going to shift gears from dealing with numerical data to text data.
Working with text as data is known as Natural Language Processing (NLP). A common use of NLP is categorizing a set of text. Perhaps the most ubiquitous example is a spam filter. It reads the text of a message and determines if it is "spam" or "ham."
We'll employ one simple NLP algorithm known as the bag-of-words algorithm to classify SMS messages as spam. There are more sophisticated methods (see end of the lecture) but this will give us the big idea. More-sophisticated methods typically amount to tweaks on how we process the data or more complex classifier models.
We will use the natural language tool kit (nltk) to help us process the text data. It comes with annaconda, but if you need to install it:
pip install nltk
We'll also be using word clouds to visualize identification so install this as well:
pip install wordcloud
The agenda for today's lecture is as follows:
1. Motivation: Spam Messages ¶
Spam emails/messages belong to the broad category of unsolicited messages received by a user. Spam occupies unwanted space and bandwidth, amplifies the threat of viruses, and in general exploits a user’s connection to social networks. Plus, they're annoying.
Our goal is to classify a message as spam (unwanted message) or ham (wanted message).
Languages are harder for algorithms to interpret and analyze than numeric data since:
Sentences are not of fixed lengths, but most algorithms require a standard input vector size.
Most algorithms cannot understand words as input: hence, each word needs to be represented by some numeric value.
So our method is:
Pre-processing: Clean up the text. This is the new stuff.
Estimate a classifier model on the training data: Let $\text{word}_{ji}$ be the number of times the word $j$ occurs in message $i$
Test the model on the testing data.
Use the estimated model to filter incoming messages
2. The bag-of-words model ¶
A bag-of-words model allows us to extract features from textual data. Since an algorithm doesn't understand language, we need to use a numeric representation for the words. This numeric representation can later be fed to any algorithm for further analysis. There are many ways to do this, bag-of-words is the simplest and provides a good foundation for working with text.
The model is called "bag-of-words" because the order of the words or the structure of the sentence is lost in this model. Only the occurrence or presence of a word matters. Hence, we can think of the model in such a way:
- we have a big empty bag
- we have a vocabulary (ie, a text or corpus).
We pick up words one by one and put them in the bag, adding to the frequency of their occurrence. We then select the most common words as features for passing through our algorithm of choice. We can therefore view our approach as identifying documents which share similar kinds of words.
Here is an example:
import numpy as np
import pandas as pd
# Corpus is a fancy word for a collection (or a body) of text.
# Label marks a message as spam (1) or not spam (0).
corpus = [('Text 1', 'You have won a prize. Call today to claim.', 1),
('Text 2', 'It is your mother. Call me.', 0),
('Text 3', 'Are you around today? I need a favor.', 1)]
data = pd.DataFrame(corpus, columns=['Document Number','Text of Documents', 'Label'])
data.head()
Document Number | Text of Documents | Label | |
---|---|---|---|
0 | Text 1 | You have won a prize. Call today to claim. | 1 |
1 | Text 2 | It is your mother. Call me. | 0 |
2 | Text 3 | Are you around today? I need a favor. | 1 |
Even though python is a good with text, we will still need to convert our text into some numeric data to get a classifier model to analyze it. Let's create a matrix with the word counts. Each row of the matrix is an observation (a message) and each column is a word. The cells in the matrix are the number of times that word is found in the message.
scikit gives us the CountVectorizer
to do this for us.
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['Text of Documents'])
print(X.toarray())
[[0 0 1 1 0 1 0 0 0 0 0 1 1 1 1 1 0] [0 0 1 0 0 0 1 1 1 1 0 0 0 0 0 0 1] [1 1 0 0 1 0 0 0 0 0 1 0 0 1 0 1 0]]
Note that X
is an array-like object. We are in the realm of scikit, which doesn't use DataFrames. Let's turn this back into a DataFrame, though, so we can see things clearly.
cols = vectorizer.get_feature_names()
count = pd.DataFrame(X.toarray(), columns=cols, index=data['Document Number'])
count.head()
C:\Users\jt83241\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead. warnings.warn(msg, category=FutureWarning)
are | around | call | claim | favor | have | is | it | me | mother | need | prize | to | today | won | you | your | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Document Number | |||||||||||||||||
Text 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 1 | 0 |
Text 2 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
Text 3 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
The X
array contains our features and the data['Label']
column contains our outcome variable. We now have the data ready to estimate a classifier model (e.g., a logit regression).
Note that this methdology of turning text into data is not limited to classification problems. For example, we could use this approach to connect stock performance with FOMC statements to predict how the Federal Reserve's statements on the economy influence the S&P500, the Dow Jones, individual stocks, and government treasury prices. NLP is a broad topic and a lot of fun.
This dataset is too small to actually fit a model, so let's move on to something bigger.
3. Developing a model to identify spam messages ¶
The dataset that we are using is an SMS spam collection dataset. It contains over 5,500 messages in English. There are two columns. The first column corresponds to the actual text message. The second column tells us whether the text is 'ham' or 'spam'.
dataset = pd.read_csv('./Data/spam.csv')
dataset.rename(columns = {'v1': 'labels', 'v2': 'message'}, inplace = True)
dataset['label'] = dataset['labels'].map({'ham': 0, 'spam': 1})
dataset
labels | message | label | |
---|---|---|---|
0 | ham | Go until jurong point, crazy.. Available only ... | 0 |
1 | ham | Ok lar... Joking wif u oni... | 0 |
2 | spam | Free entry in 2 a wkly comp to win FA Cup fina... | 1 |
3 | ham | U dun say so early hor... U c already then say... | 0 |
4 | ham | Nah I don't think he goes to usf, he lives aro... | 0 |
... | ... | ... | ... |
5567 | spam | This is the 2nd time we have tried 2 contact u... | 1 |
5568 | ham | Will Ì_ b going to esplanade fr home? | 0 |
5569 | ham | Pity, * was in mood for that. So...any other s... | 0 |
5570 | ham | The guy did some bitching but I acted like i'd... | 0 |
5571 | ham | Rofl. Its true to its name | 0 |
5572 rows × 3 columns
Data Pre-Processing¶
This is the part that makes nlp different from working with numeric data. We need to clean up the text and turn it into a feature matrix.
'I am helping raise $100 for UGA Athens' original
'i am helping raise $100 for uga athens' homogenize spelling
'i am helping raise for uga athens' remove non-alphabetic characters
['i', 'am', 'helping', 'raise', 'for', 'uga', 'athens'] tokenize
['helping', 'raise', 'uga', 'athens'] remove stop words
['help', 'raise', 'uga', 'athens'] stem and lem
'help raise uga athens' back to a single string
Then create the feature matrix
help | raise | uga | athens |
---|---|---|---|
1 | 1 | 1 | 1 |
1. Homogenize the capitalization¶
We don't want to worry about 'Hello' not being equal to 'hello'. Let's make everything lowercase.
# 1. Homogenize capitalization
dataset['message'] = dataset['message'].str.lower()
dataset.tail()
labels | message | label | |
---|---|---|---|
5567 | spam | this is the 2nd time we have tried 2 contact u... | 1 |
5568 | ham | will ì_ b going to esplanade fr home? | 0 |
5569 | ham | pity, * was in mood for that. so...any other s... | 0 |
5570 | ham | the guy did some bitching but i acted like i'd... | 0 |
5571 | ham | rofl. its true to its name | 0 |
2. Remove non-alphabetic characters¶
Our algorithm will only use words to characterize messages. This is not necessary, but it simplifies our approach today. Perhaps messages with numbers in them are more likely to be spam?
We will remove them using a regular expression. We have not covered 'regex' (there is never enough time!) but regex is a powerful string search language that is a part of python and most other programming languages. I have a notebook on regex here which you can work though if you are interested.
The code to remove the non-alphabetic characters is
dataset['message'].str.replace('[^A-Za-z]', ' ', regex=True)
The regex part is the '[^A-Za-z]'
. It says: "find everything that is not the letters A through Z or a through z." We replace the non-alphabetic stuff with a space.
# 2. Remove non-alphabetic characters
dataset['message'] = dataset['message'].str.replace('[^A-Za-z]', ' ', regex=True)
dataset.tail()
labels | message | label | |
---|---|---|---|
5567 | spam | this is the nd time we have tried contact u... | 1 |
5568 | ham | will b going to esplanade fr home | 0 |
5569 | ham | pity was in mood for that so any other s... | 0 |
5570 | ham | the guy did some bitching but i acted like i d... | 0 |
5571 | ham | rofl its true to its name | 0 |
3. Tokenize the strings¶
Break the stings up into lists of words, which are easier to process. This is very similar to using .str.split(' ')
. Here we use the tokenizer method from nltk. It is a bit more sophisticated than a simple split.
from nltk.tokenize import word_tokenize as wt
We also need to download the punctuation dataset.
import nltk
# 3. Tokenize the strings.
# Get the punctuation.
nltk.download('punkt')
from nltk.tokenize import word_tokenize as wt
dataset['message'] = dataset['message'].apply(wt)
dataset.tail()
[nltk_data] Downloading package punkt to [nltk_data] C:\Users\jt83241\AppData\Roaming\nltk_data... [nltk_data] Package punkt is already up-to-date!
labels | message | label | |
---|---|---|---|
5567 | spam | [this, is, the, nd, time, we, have, tried, con... | 1 |
5568 | ham | [will, b, going, to, esplanade, fr, home] | 0 |
5569 | ham | [pity, was, in, mood, for, that, so, any, othe... | 0 |
5570 | ham | [the, guy, did, some, bitching, but, i, acted,... | 0 |
5571 | ham | [rofl, its, true, to, its, name] | 0 |
4. Removing stop words¶
Now we eliminate stop words—words in the text that add no specific meaning. They often involve prepositions, helping verbs, and articles (e.g., in, the, an, is). Since these add no value to our model, let's get rid of them.
Fortunately, linguists have already identified stopwords so we can readily identify and exclude them
from nltk.corpus import stopwords
stop_wrds = stopwords.words('english')
stop_wrds
is a list of English-language stop words.
We need to loop through the lists and check for stop words. I will write a small function that does the looping and then apply it to the DataFrame's column using .apply()
.
Again, we need to download the stopwords, first.
# 4. Remove stop words.
nltk.download('stopwords')
from nltk.corpus import stopwords
def remove_stops(x):
stop_wrds = stopwords.words('english')
temp = []
for word in x:
if word not in stop_wrds:
temp.append(word)
return temp
dataset['message'] = dataset['message'].apply(remove_stops)
dataset.tail()
[nltk_data] Downloading package stopwords to [nltk_data] C:\Users\jt83241\AppData\Roaming\nltk_data... [nltk_data] Package stopwords is already up-to-date!
labels | message | label | |
---|---|---|---|
5567 | spam | [nd, time, tried, contact, u, u, pound, prize,... | 1 |
5568 | ham | [b, going, esplanade, fr, home] | 0 |
5569 | ham | [pity, mood, suggestions] | 0 |
5570 | ham | [guy, bitching, acted, like, interested, buyin... | 0 |
5571 | ham | [rofl, true, name] | 0 |
5. Stemming and Lemmatization¶
Words like act, actor, and acting are all versions of the same root word (act). Stemming and lemmatization are techniques used to truncate words in order to get the stem or the base word. The difference between these two methods is that after stemming, the stem may not be an actual word, whereas lemmatization always produces a real world, which results in better interpretation of the corpora by humans.
For example, studies could be stemmed as studi (not a word), but will be lemmatized as study (an existing word). To be honest, this feels like a rabbit hole so I'm treating this stuff as a block box and trusting that the linguists are doing a good job.
Let's stem these words.
# 5. Stemming and lemmatization
from nltk.stem.porter import PorterStemmer
def stem_it(x):
stemmer = PorterStemmer()
return [stemmer.stem(w) for w in x]
dataset['message'] = dataset['message'].apply(stem_it)
dataset.tail()
labels | message | label | |
---|---|---|---|
5567 | spam | [nd, time, tri, contact, u, u, pound, prize, c... | 1 |
5568 | ham | [b, go, esplanad, fr, home] | 0 |
5569 | ham | [piti, mood, suggest] | 0 |
5570 | ham | [guy, bitch, act, like, interest, buy, someth,... | 0 |
5571 | ham | [rofl, true, name] | 0 |
That seemed like a lot of work, but it always does when we are first learning something. Putting all the code together, the processing is simply:
dataset['message'] = dataset['message'].str.lower()
dataset['message'] = dataset['message'].str.replace('[^A-Za-z]', ' ', regex=True)
dataset['message'] = dataset['message'].apply(wt)
dataset['message'] = dataset['message'].apply(remove_stops)
dataset['message'] = dataset['message'].apply(stem_it)
You could even wrap all that up in a function, too...
Create the feature matrix¶
We are done pre-processing.
- Turn the lists of words back into strings.
- Create the feature matrix using
CountVectorizer
.
# The matrix of word counts. I am limiting the feature matrix to 1000 columns.
dataset['message'] = dataset['message'].str.join(' ')
X = CountVectorizer(max_features=1000).fit_transform(dataset['message'])
# The outcome data.
y = dataset['label']
Visualizing Keywords¶
We've seen that data visualization is a handy way of better understanding data variation and that's still true here though the visualization tools will be different. For example, we can make a wordcloud, which represents most common words in a space, with the size of each word proportional to the frequency of its occurrence. To do so, we need to add a new package:
pip install wordcloud
from wordcloud import WordCloud
import matplotlib.pyplot as plt
import seaborn as sns
spam_words = ' '.join(list(dataset[dataset['label'] == 1]['message']))
spam_wc = WordCloud(width = 600,height = 512).generate(spam_words)
plt.figure(figsize = (12, 8), facecolor = 'k')
plt.imshow(spam_wc)
plt.axis('off')
plt.tight_layout(pad = 0)
plt.show()
Estimate the logit model¶
Nothing new here.
- Split data into test/train
- Estimate on training data
- Test on, well, testing data
# Create the train-test split.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0)
# Estimate the model
from sklearn.linear_model import LogisticRegression
model_logit = LogisticRegression(random_state=0).fit(X_train, y_train)
accuracy = model_logit.score(X_test, y_test)
print('The accuracy of the logit model is {0:4.2f} percent.'.format(accuracy*100))
The accuracy of the logit model is 97.99 percent.
Not too shabby!
Practice¶
We're going to practice using the "20 Newsgroup" data set which is
a collection of approximately 20,000 newsgroup documents, partitioned (nearly) evenly across 20 different newsgroups. To the best of my knowledge, it was originally collected by Ken Lang, probably for his "Newsweeder: Learning to filter netnews" paper, though he does not explicitly mention this collection. The 20 newsgroups collection has become a popular data set for experiments in text applications of machine learning techniques, such as text classification and text clustering.
This data set is located within scikit learn but I've created a csv file you can load.
Download, unzip, and load the file 'newsgroups.csv'. Only import the first 500 rows, Try the
nrows
option of.read_csv()
.article
is the message.category code
is the newsgroup category code.category
is the newsgroup category name.Our goal: Create a classifier that predicts the category code of an article.
df = pd.read_csv('./Data/newsgroups.csv', nrows=500)
df.head(2)
article | category code | category | |
---|---|---|---|
0 | I was wondering if anyone out there could enli... | 7 | rec.autos |
1 | A fair number of brave souls who upgraded thei... | 4 | comp.sys.mac.hardware |
- Make sure "article" is a string. Use
.astype(str)
.
df['article'] = df['article'].astype('str')
- How many articles are there in each category. Looks like it's time for
.groupby()
.
df.groupby('category')['article'].count()
category alt.atheism 26 comp.graphics 23 comp.os.ms-windows.misc 34 comp.sys.ibm.pc.hardware 16 comp.sys.mac.hardware 35 comp.windows.x 22 misc.forsale 27 rec.autos 21 rec.motorcycles 29 rec.sport.baseball 30 rec.sport.hockey 25 sci.crypt 23 sci.electronics 30 sci.med 29 sci.space 27 soc.religion.christian 33 talk.politics.guns 18 talk.politics.mideast 24 talk.politics.misc 18 talk.religion.misc 10 Name: article, dtype: int64
- Process the text data. All the code to do this is gathered in the cell above the Create feature matrix heading.
df['article'] = df['article'].str.lower()
df['article'] = df['article'].str.replace('[^A-Za-z]', ' ', regex=True)
df['article'] = df['article'].apply(wt)
df['article'] = df['article'].apply(remove_stops)
df['article'] = df['article'].apply(stem_it)
df.head()
article | category code | category | |
---|---|---|---|
0 | [wonder, anyon, could, enlighten, car, saw, da... | 7 | rec.autos |
1 | [fair, number, brave, soul, upgrad, si, clock,... | 4 | comp.sys.mac.hardware |
2 | [well, folk, mac, plu, final, gave, ghost, wee... | 4 | comp.sys.mac.hardware |
3 | [weitek, address, phone, number, like, get, in... | 1 | comp.graphics |
4 | [articl, c, owcb, n, p, world, std, com, tomba... | 14 | sci.space |
- Turn the lists of words in
article
into strings using.str.join(' ')
df['article'] = df['article'].str.join(' ')
- Create the feature matrix.
matrix = CountVectorizer(max_features=10000)
X = matrix.fit_transform(df['article']).toarray()
- Create the outcome variable. (the Series that contains the category codes)
y = df['category code']
- Create your testing and training datasets.
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0,test_size=0.2)
- Estimate a logit model.
model_logit = LogisticRegression(random_state=0, max_iter=1000).fit(X_train, y_train)
- Check the accuracy of your model by omputing the
.score()
.
print('The accuracy of the logit model is {0:4.2f} percent.'.format(model_logit.score(X_test, y_test)*100))
The accuracy of the logit model is 38.00 percent.
- Go back to step 1. and increase the number of rows to 1000. Rerun your code. Does the accuracy improve?
df = pd.read_csv('./Data/newsgroups.csv', nrows=1000)
# Data preparation
df['article'] = df['article'].astype('str')
df['article'] = df['article'].str.lower()
df['article'] = df['article'].str.replace('[^A-Za-z]', ' ', regex=True)
df['article'] = df['article'].apply(wt)
df['article'] = df['article'].apply(remove_stops)
df['article'] = df['article'].apply(stem_it)
df['article'] = df['article'].str.join(' ')
# Create the features and outcome variable
matrix = CountVectorizer(max_features=10000)
X = matrix.fit_transform(df['article']).toarray()
y = df['category code']
# Train and test the model
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0,test_size=0.2)
model_logit = LogisticRegression(random_state=0, max_iter=1000).fit(X_train, y_train)
print('The accuracy of the logit model is {0:4.2f} percent.'.format(model_logit.score(X_test, y_test)*100))
The accuracy of the logit model is 51.50 percent.
- Try a Random Forest model. Set the
n_estimators=100
andrandom_state=0
.
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
clf = RandomForestClassifier(n_estimators=100,random_state=0)
#Train the model using the training sets y_pred=clf.predict(X_test)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
print('The accuracy of the random forest model is {0:4.2f} percent.'.format(metrics.accuracy_score(y_test, y_pred)*100))
The accuracy of the random forest model is 49.00 percent.
- There are about 20,000 rows in the data file. You can keep adding data, but things are going to get slow. This is why big data projects need special techniques and big computers. That being said, the "big" ideas are the same -- we're just operating at a bigger scale.
import time # used below as a timer
for nrow in [1000, 5000, 10000, 15000, 'All']:
start_time = time.time()
if nrow != 'All':
print('Loading sample with ' + str(nrow) + ' observations...')
df = pd.read_csv('./Data/newsgroups.csv', nrows=nrow)
else:
print('Loading full sample...')
df = pd.read_csv('./Data/newsgroups.csv')
# Data preparation
df['article'] = df['article'].astype('str')
df['article'] = df['article'].str.lower()
df['article'] = df['article'].str.replace('[^A-Za-z]', ' ', regex=True)
df['article'] = df['article'].apply(wt)
df['article'] = df['article'].apply(remove_stops)
df['article'] = df['article'].apply(stem_it)
df['article'] = df['article'].str.join(' ')
# Create the features and outcome variable
matrix = CountVectorizer(max_features=10000)
X = matrix.fit_transform(df['article']).toarray()
y = df['category code']
# Train and test the model
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0,test_size=0.2)
model_logit = LogisticRegression(random_state=0, max_iter=1000).fit(X_train, y_train)
print('The accuracy of the logit model is {0:4.2f} percent.'.format(model_logit.score(X_test, y_test)*100))
print("Time to run: " + str(round((time.time() - start_time)/60,2)) + ' minutes.' )
print('\n')
Loading sample with 1000 observations... The accuracy of the logit model is 51.50 percent. Time to run: 0.23 minutes. Loading sample with 5000 observations... The accuracy of the logit model is 62.50 percent. Time to run: 1.15 minutes. Loading sample with 10000 observations... The accuracy of the logit model is 66.00 percent. Time to run: 2.38 minutes. Loading sample with 15000 observations... The accuracy of the logit model is 66.40 percent. Time to run: 3.5 minutes. Loading full sample... The accuracy of the logit model is 68.44 percent. Time to run: 5.05 minutes.
4. Term Frequency Inverse Document Frequency (TF-IDF) ¶
Let's try adding a little more intelligence to the words count methodology by replacing bag-of-words with TF-IDF (Term Frequency Inverse Document Frequency) to account for the importance of words. This is very common algorithm to transform text into a meaningful representation of numbers which is used to fit machine algorithm for prediction. The statistic is designed to better define how important a word is for a document, while also taking into account the relation to other documents from the same corpus. This is performed by looking at how many times a word appears into a document while also paying attention to how many times the same word appears in other documents in the corpus.
Let's dig into the details a bit.
TF-IDF consists of two parts :
- Term Frequency (TF): how many times a word appears in a document
- Inverse Document Frequency (IDF): Inverse document frequency of the word across a set of documents.
Let's demonstrate the idea with the example from before:
corpus = [('Document 1', 'Alot of people like to play football'),
('Document 2', 'many like to eat'),
('Document 3', 'According to data, many like to sing')]
data = pd.DataFrame(corpus,columns=['Document Number','text of Documents'])
data.head()
Document Number | text of Documents | |
---|---|---|
0 | Document 1 | Alot of people like to play football |
1 | Document 2 | many like to eat |
2 | Document 3 | According to data, many like to sing |
To find "term frequency" count the number of times each unique word occurs in the text. This is where we stopped with Bag-of-Words.
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(data['text of Documents'])
cols = vectorizer.get_feature_names()
count = pd.DataFrame(X.toarray(), columns=cols, index=['Document 1','Document 2','Document 3'])
count.head()
C:\Users\jt83241\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:87: FutureWarning: Function get_feature_names is deprecated; get_feature_names is deprecated in 1.0 and will be removed in 1.2. Please use get_feature_names_out instead. warnings.warn(msg, category=FutureWarning)
according | alot | data | eat | football | like | many | of | people | play | sing | to | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Document 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 | 1 |
Document 2 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 |
Document 3 | 1 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 2 |
Term frequency tends to give more weightage to common words (eg, "is", "if", "the") so we need something to reduce this effect since common words won't be useful to identify a pattern in the document. For example, if we had a document on sports and another on medicine, words like "Football" and "Hypertension" will be rare and separate (identify) the documents. Zipf's Law tells us that the frequency of any word is inversely proportional to its rank in a frequency table. Thus the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc.
Accordingly, IDF applies more weight to words which occur rarely.
# df is the document frequency (how many dcouments word has occured)
df = np.array(count.astype(bool).sum())
n_samples = len(data)
smooth_idf = True # smooth_idf is used to avoid divide by zero
df += int(smooth_idf)
n_samples += int(smooth_idf)
idf = np.log(n_samples / df) + 1 # (12,1) vector. apply logs to dampen the effects of words which happen often.
The TF-IDF score is then simply the interaction of the two.
tfidf_before_normalization = count*idf
pd.DataFrame(tfidf_before_normalization, columns=cols, index=['Document 1','Document 2','Document 3'])
according | alot | data | eat | football | like | many | of | people | play | sing | to | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Document 1 | 0.000000 | 1.693147 | 0.000000 | 0.000000 | 1.693147 | 1.0 | 0.000000 | 1.693147 | 1.693147 | 1.693147 | 0.000000 | 1.0 |
Document 2 | 0.000000 | 0.000000 | 0.000000 | 1.693147 | 0.000000 | 1.0 | 1.287682 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.0 |
Document 3 | 1.693147 | 0.000000 | 1.693147 | 0.000000 | 0.000000 | 1.0 | 1.287682 | 0.000000 | 0.000000 | 0.000000 | 1.693147 | 2.0 |
Scikit-learn then normalizes these values using the l2 norm:
from sklearn.preprocessing import normalize
tf_idf = normalize(tfidf_before_normalization, norm='l2', axis=1)
pd.DataFrame(tf_idf, columns=cols, index=['Document 1','Document 2','Document 3'])
according | alot | data | eat | football | like | many | of | people | play | sing | to | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
Document 1 | 0.000000 | 0.41894 | 0.000000 | 0.00000 | 0.41894 | 0.247433 | 0.000000 | 0.41894 | 0.41894 | 0.41894 | 0.000000 | 0.247433 |
Document 2 | 0.000000 | 0.00000 | 0.000000 | 0.66284 | 0.00000 | 0.391484 | 0.504107 | 0.00000 | 0.00000 | 0.00000 | 0.000000 | 0.391484 |
Document 3 | 0.433452 | 0.00000 | 0.433452 | 0.00000 | 0.00000 | 0.256004 | 0.329651 | 0.00000 | 0.00000 | 0.00000 | 0.433452 | 0.512007 |
That's it. Now use these features to understand some outcome variable of interest in a ML model.
Important: We did TF-IDF by hand in the code above but you can incorporate TF-IDF easilly by importing TfidfVectorizer
via
from sklearn.feature_extraction.text import TfidfVectorizer
Then use TfidfVectorizer()
instead of CountVectorizer
in your code.
Practice¶
- Develop the newsgroup model again using the TF-IDF algoorithm and logistic model for the first 500 rows. Does this technique improve the model's effectiveness at classifying the articles?
import time # used below as a timer
from sklearn.feature_extraction.text import TfidfVectorizer
for nrow in [1000, 5000, 10000, 15000, 'All']:
start_time = time.time()
if nrow != 'All':
print('Loading sample with ' + str(nrow) + ' observations...')
df = pd.read_csv('./Data/newsgroups.csv', nrows=nrow)
else:
print('Loading full sample...')
df = pd.read_csv('./Data/newsgroups.csv')
# Data preparation
df['article'] = df['article'].astype('str')
df['article'] = df['article'].str.lower()
df['article'] = df['article'].str.replace('[^A-Za-z]', ' ', regex=True)
df['article'] = df['article'].apply(wt)
df['article'] = df['article'].apply(remove_stops)
df['article'] = df['article'].apply(stem_it)
df['article'] = df['article'].str.join(' ')
# Create the features and outcome variable
matrix = TfidfVectorizer(max_features=10000) # TF-IDF
X = matrix.fit_transform(df['article']).toarray()
y = df['category code']
# Train and test the model
X_train, X_test, y_train, y_test = train_test_split(X, y,random_state=0,test_size=0.2)
model_logit = LogisticRegression(random_state=0, max_iter=1000).fit(X_train, y_train)
print('The accuracy of the logit model is {0:4.2f} percent.'.format(model_logit.score(X_test, y_test)*100))
print("Time to run: " + str(round((time.time() - start_time)/60,2)) + ' minutes.' )
print('\n')
Loading sample with 1000 observations... The accuracy of the logit model is 49.00 percent. Time to run: 0.2 minutes. Loading sample with 5000 observations... The accuracy of the logit model is 71.20 percent. Time to run: 0.96 minutes. Loading sample with 10000 observations... The accuracy of the logit model is 73.95 percent. Time to run: 1.94 minutes. Loading sample with 15000 observations... The accuracy of the logit model is 71.67 percent. Time to run: 2.9 minutes. Loading full sample... The accuracy of the logit model is 73.69 percent. Time to run: 3.81 minutes.