Lecture 1: The Bias-Variance Tradeoff [SUGGESTED SOLUTIONS]
This notebook begins our foray into machine learning. Our goals are modest:
- learn a bit about how machine learning is similar to, and different from, econometrics.
- introduce set-validation -- an important part of the machine learning toolkit.
- introduce the scikit-learn package which is chock full of 'machine learning' tools.
The agenda for today's lecture is as follows:
1. The Bias-Variance Trade-off (top)¶
Our objective in ML is to fit a model
$$ y = f(X) + \epsilon.$$such that when we get new data ($X$) we can make good predictions $(\hat y)$. Doing this requires specification of $f()$.
Any such decision will generate errors -- of which there are three potential types of prediction error:
bias
variance
irreducible error.
Irreducible error is also known as "noise" and reflects our belief that there's generally randomness in our world and we will never be able to predict it. "Noise" can't be reduced by our choice of $f()$, or equivalently our choice of algorithm.
The other two types of errors, however, can be reduced because they stem from our choice of $f()$ and algorithm.
Error from Bias¶
"Bias" refers to the difference between the model's expected predictions and the true values.
That might sound strange because shouldn't you "expect" your predictions to be close to the true values? Well, it's not always that easy because an algorithm may be too rigid to learn complex signals from the dataset.
Let's consider an example:
# Initialize packages
import numpy as np
import numpy.random as npr
import matplotlib.pyplot as plt
import pandas as pd
# Set values
N = 100
# Generate J random draws with an exponential distribution
# X = npr.uniform(1,size=N)
X = np.arange(0, 10, 0.1);
# Generate J random draws with an normal distribution
eps = npr.normal(0,1,size=N)
# Determine Y
Y = 2*np.sin(X) + X*.5 + eps
# Create Figure
fig, ax = plt.subplots(figsize=(10,5))
ax.scatter(X,Y, alpha = 0.4, color = 'red')
ax.set_xlabel('Independent Variable (X)')
ax.set_ylabel('Dependent Variable (Y)')
ax.set_title('Nonlinear Raw Data',size=20)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
The data clearly has a non-linear pattern. Imagine we attempt to fit a linear regression to these data:
import statsmodels.formula.api as smf # provides a way to directly spec models from formulas
df = pd.DataFrame(Y.T, X.T) # Convert to DataFrame
res = smf.ols('Y ~ X', data=df).fit() # OLS
Yhat = res.predict()
bhat = res.params['X']
se = res.bse['X']
print(res.summary()) # print the results
OLS Regression Results ============================================================================== Dep. Variable: Y R-squared: 0.379 Model: OLS Adj. R-squared: 0.373 Method: Least Squares F-statistic: 59.92 Date: Wed, 12 Oct 2022 Prob (F-statistic): 9.03e-12 Time: 08:18:05 Log-Likelihood: -199.12 No. Observations: 100 AIC: 402.2 Df Residuals: 98 BIC: 407.5 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 0.4564 0.355 1.284 0.202 -0.249 1.162 X 0.4801 0.062 7.741 0.000 0.357 0.603 ============================================================================== Omnibus: 4.574 Durbin-Watson: 0.746 Prob(Omnibus): 0.102 Jarque-Bera (JB): 3.167 Skew: -0.278 Prob(JB): 0.205 Kurtosis: 2.328 Cond. No. 11.6 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Now plot the predicted values on top of the data:
# Create Figure
fig, ax = plt.subplots(figsize=(15,10))
ax.scatter(X,Y, alpha = 0.4, color = 'red')
ax.plot(X,Yhat, color = 'blue',linestyle='--')
ax.set_xlabel('Independent Variable (X)',size=14)
ax.set_ylabel('Dependent Variable (Y)',size=14)
ax.set_title('Nonlinear Data with Linear Predictive Model',size=20)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.annotate(fr'$\hat\beta=${bhat:.2f} ({se:.4f})', (4,4.0),size=16, color='blue')
plt.show()
The picture makes it pretty clear that no amount of data will enable you to fit a line to these data! We call this under-fitting and define the error as "bias."
Error from Variance¶
We call the other form of error from an alorithm "variance." This kind of error stems for our specified model's sensitivity to specific sets of training data. At this point I haven't actually introduced what training data are but for now just consider it the data we use to fit the model. For example, the red dots in the above figure. When a model exhibits "high variance" we mean that it changes dramatically given small deviations in the training data.
Consider an extreme (and perhaps silly) example where we perfecytly fit a very non-linear model to our data (blue, dashed line). This model takes values of data (X) and perfecly maps to the outcome variable (Y). Suppose now we get new data (black dots).
# New data
Y2 = 2*np.sin(X) -2 + eps
# Create Figure
fig, ax = plt.subplots(figsize=(15,10))
ax.scatter(X,Y, alpha = 0.4, color = 'red',label='Training Data')
ax.plot(X,Y, color = 'blue',linestyle='--',label='Nonlinear Model Fit')
ax.scatter(X,Y2, color = 'black',label='New Data')
ax.set_xlabel('Independent Variable (X)',size=14)
ax.set_ylabel('Dependent Variable (Y)',size=14)
ax.set_title('Nonlinear Data with Nonlinear Predictive Model',size=20)
ax.legend(frameon=False)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
How does our nonlinear model do when presented with new data?¶
It's terrible! For every x-value our mdoel predicts the y-valued red-dot but the actual outcome would be the black dot. If X data were market indicators and Y data were stock market prices, making trades based on the blue-dashed nonlinear model would land you in serious debt!
Our unconstrained nonlinear model has basically memorized the training data, including the noise. This is known as over-fitting.
Bias versus Variance¶
Let's reframe the problem slightly. Think of there existing one true data generating process (DGP). We can get access to examples of that DGP. For each example, we can estimate (econometrics) or fit (machine learning) a model.
- Imagine you've collected 5 different training sets for the same problem.
- Now imagine using one algorithm (OLS above) to train 5 models -- one for each of your training sets.
- Bias vs. Variance refers to the accuracy vs. consistency of the models.
We can diagnose them as follows.
What do we learn?¶
Low variance (high bias) algorithms tend to be less complex, with simple or rigid underlying structure.
- They train models that are consistent, but inaccurate on average.
- Examples: linear or parametric algorithms such as OLS regression and naive Bayes.
On the other hand, low bias (high variance) algorithms tend to be more complex and have a flexible underlying structure.
- They train models that are accurate on average, but inconsistent.
- These include non-linear or non-parametric algorithms such as decision trees and nearest neighbors.
The Big Idea An algorithm cannot be simultaneously more complex and less complex so there is a tradeoff between bias and variance. Our task will be to incorporate tools and methods to develop a balance.
A More Mathematical Approach¶
The above is a bit hand-wavy so let's cover our bases (asses?) and do a proper treatment using math. Any fitting algorithm (e.g., OLS) will choose a model to best predict the training data. A simple and often used metric is mean-squared error which we can split into our three error terms using a little bit of arithmetic.
we fit a model to go through the data so the expected squared error at a point x is
$$ Err(x) = E\big[\big(y-\hat f(x)\big)^2\big] \equiv \frac{1}{N}\sum_n \big(y_i-\hat f(x_i)\big)^2 $$where $\hat f$ reflects our fitted model. We can further decompose these errors into the following
$$ \begin{align*} Err(x) = \underbrace{\bigg(E\big[\hat f(x)\big]-f(x)\bigg)^2}_{\text{Bias}^2} + \underbrace{E\bigg[\big(\hat f(x)-E\big[\hat f(x)\big]\big)^2\bigg]}_{\text{Variance}} + \sigma_\epsilon^2 \end{align*} $$The $\sigma_\epsilon^2$ reflects some fundamental, irreducible error (what I called "noise" above). The "hats" above a variable indicate a fitted-value and "N" is the number of observations in the training data. The $E[\cdot]$ terms correspond to the average where $E(X)=\frac{1}{N}\sum_n x_n$ and I used the $E$ notation only to save space (and minimize greek letters). An examination of the three terms demonstrates the connections and heuristics we described above:
The bias$^2$ term tells us, on average, how close we get to the true model. Are we systematically over- or under-predicting $y$?
The variance term tells us how much our estimates vary as we use different training datasets to estimate the model.
To build a good predictive model, you will need to find a balance between bias and variance that minimizes the total error (i.e., mean-squared error). In this course we'll develop machine learning processes to find that optimal balance.
Summary¶
If our model is too simple and has very few parameters, it will suffer from high bias and low variance. On the other hand if our model has large number of parameters, it's going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data.
Okay, so you've seen some math and hopefully absorbed some intuition.
At the very least, you need to understand what the "bias-variance trade-off" is.
Seriously. These are words that any competent ML practitioner should know -- they're fundamental. Also, this definitely will come up in an interview should you drop ML as one of your skills either on your resume or in conversation. If you do and the interviewer does not ask you about this, it's a red flag because they have no idea what ML really is.
2. What to Do? The Set Validation Approach (top)¶
Recall the heuristic I used for the dart board above: Imagine one true DGP and we have multiple shots at fitting a model to it. Well, we can use this idea to evaluate how good (or not good) our model is.
Our solution? Randomly break-up our data to identify over-fitting. This would be like our first simulated example from above.
- Break up the original dataset into two subsets. Call one the training set and the other the testing set.
- Fit model on the training data.
- Using the X testing data and the estimated model $\hat f(\;)$ from step 2, generate predictions for y: $$\hat{y} = \hat{f}(X_{test}).$$
- We then compare the predicted ($\hat{y}$) values with the actual ($y$) values in the testing data. We measure how well our model fits the data by looking at some kind of summary statistic; eg, in continuous models this would be the mean squared error (mse) of the regression or, alternatively, the $R^2$, which normalizes the mse by the total squared error.
Comments
- The Big Idea is that we're imagining we only get to fit the model to the training data set and then later on get the testing data set. Our underlying assumption is that the same DGP generated both.
- Note the subtle but very deep point: In inference, the $R^2$ is nearly a meaningless statistic but in prediction it's probably the most important statistic. This clearly demonstrates the difference in philosophy between inference (econometrics) and prediction (machine learning).
- Another important conceptual point is that improving the training score does not always improve the test score. This is the overfitting problem in a nutshell. By making the model more flexible and adding more variables to $X$ we can improve the training mse (a lot). The problem is that we will start to capture more and more of the $\epsilon$ values in our estimate of $f( )$, which will make it a worse estimate of the true $f(\;)$.
- Using the test mse/$R^2$ to validate the model is a general idea for continuoous OLS-type models but this approach (and modifications of it) are applicable to many different approaches to estimating $f(\;)$. Frankly, we just need some kind of measuring stick (a "metric" in math lingo) and there are plenty to choose from. We'll see an example of using a different measuring stick below with a "classification" problem for discrete data.
Enough talk: **How do we implement this algorithm?**
3. An Example: Classification via Scikit Learn
(top)¶
When we are trying to predict something that can only take a discrete number of values, we call it a classification problem. Classification is a nice place to start because it's straightforward to imagine and is highly-relevant to machine learning; eg,
- Streaming history used to predict the next show or movie.
- Internet purchase histories to recommend products to buy.
- Medical images to determine whether a mammogram has a tumor.
- Self-driving cars to determine whether an object is a person, dog, sign, tree, etc.
We are going to use the scikit learn package, which is a popular package for machine learning analysis.
pip install sklearn
We are going to do things piece-by-piece so we can understand what we are doing. Once we know what we are doing we can set up and estimate models with very little coding.
import pandas as pd # data wrangling
import seaborn as sns # useful for exploratory data analysis (EDA)
import sklearn # machine learning
from sklearn.linear_model import LogisticRegression
Predicting Iris Types¶
When we are trying to predict something that can only take a discrete number of values, we call it a classification problem. The most famous data set and classifier example is the iris data set. Yes, it's so famous it has its own wikipedia page! The iris) we're interested in is a flower and not part of your eye. The flower is made up of sepals and petals. I did not know what a sepal was until I started thinking about machine learning.
The goal is to predict an iris's type using only the length and width of the sepal and the petal. There are three types of irises in the data set: Iris-setosa, Iris-versicolor, and Iris-virginica.
Let's load the data and get to work.
iris = pd.read_csv('./Data/iris.csv')
iris.sample(10)
sepal length | sepal width | petal length | petal width | species | |
---|---|---|---|---|---|
65 | 6.7 | 3.1 | 4.4 | 1.4 | Iris-versicolor |
149 | 5.9 | 3.0 | 5.1 | 1.8 | Iris-virginica |
29 | 4.7 | 3.2 | 1.6 | 0.2 | Iris-setosa |
101 | 5.8 | 2.7 | 5.1 | 1.9 | Iris-virginica |
139 | 6.9 | 3.1 | 5.4 | 2.1 | Iris-virginica |
88 | 5.6 | 3.0 | 4.1 | 1.3 | Iris-versicolor |
53 | 5.5 | 2.3 | 4.0 | 1.3 | Iris-versicolor |
16 | 5.4 | 3.9 | 1.3 | 0.4 | Iris-setosa |
146 | 6.3 | 2.5 | 5.0 | 1.9 | Iris-virginica |
66 | 5.6 | 3.0 | 4.5 | 1.5 | Iris-versicolor |
Each row is an observation. We can see our four characteristics and the class. Let's look at the average characteristic per type and look for patterns.
iris.groupby('species').mean()
sepal length | sepal width | petal length | petal width | |
---|---|---|---|---|
species | ||||
Iris-setosa | 5.006 | 3.418 | 1.464 | 0.244 |
Iris-versicolor | 5.936 | 2.770 | 4.260 | 1.326 |
Iris-virginica | 6.588 | 2.974 | 5.552 | 2.026 |
- setosa has small petals
- versicolor and virginica has larger petals, but versica's are narrower
- I have a hard time finding a pattern in the sepal data
Facet plots¶
Staring at numbers may not be the easiest way to see patterns. I find staring a plots to be easier. seaborn gives us the .pairplot()
method to help visualize patterns in the data.
g = sns.pairplot(iris, hue='species')
Oh, that's nice.
We only need to look at the bottom triangle of the matrix. The top triangle tells us the same thing.
It jumps out as me that the setosa is pretty different than the other two. The versicolor and the virginica are more similar, but there is a clear division between the two (expect in the sepal length by sepal width facet). That we can see these patterns with our eyes suggests a model should work well. Such pictures are also great for providing intuition for a reader who might view machine learning as a 'black box.'
Scikit learn¶
We begin by preparing our data. We want a testing and a training dataset. We also need to convert our data from pandas to numpy arrays. scikit
does not work with DataFrames -- lame.
We'll learn best-practices for constructing training and testing data sets in future lectures. For now, let's wing it.
iris.shape
(150, 5)
# Keep 10 random observations to test. Why 10? No real good reason, really. We'll address how best to train/ test later.
# The random_state variable ensures that we get the same 10 random observations each time.
test_data = iris.sample(10, random_state=0)
# The species data
y_test_df = test_data['species']
# The characteristics data
X_test_df = test_data.drop('species', axis = 1)
X_test_df
sepal length | sepal width | petal length | petal width | |
---|---|---|---|---|
114 | 5.8 | 2.8 | 5.1 | 2.4 |
62 | 6.0 | 2.2 | 4.0 | 1.0 |
33 | 5.5 | 4.2 | 1.4 | 0.2 |
107 | 7.3 | 2.9 | 6.3 | 1.8 |
7 | 5.0 | 3.4 | 1.5 | 0.2 |
100 | 6.3 | 3.3 | 6.0 | 2.5 |
40 | 5.0 | 3.5 | 1.3 | 0.3 |
86 | 6.7 | 3.1 | 4.7 | 1.5 |
76 | 6.8 | 2.8 | 4.8 | 1.4 |
71 | 6.1 | 2.8 | 4.0 | 1.3 |
# Convert the DataFrames to numpy arrays
y_test = y_test_df.to_numpy()
X_test = X_test_df.to_numpy()
X_test
array([[5.8, 2.8, 5.1, 2.4], [6. , 2.2, 4. , 1. ], [5.5, 4.2, 1.4, 0.2], [7.3, 2.9, 6.3, 1.8], [5. , 3.4, 1.5, 0.2], [6.3, 3.3, 6. , 2.5], [5. , 3.5, 1.3, 0.3], [6.7, 3.1, 4.7, 1.5], [6.8, 2.8, 4.8, 1.4], [6.1, 2.8, 4. , 1.3]])
# Do the same thing, but to the rest of the data --- the training data
# First, drop the testing observations.
train_data = iris.drop(test_data.index)
y_train = train_data['species'].to_numpy()
X_train = train_data.drop('species', axis=1).to_numpy()
train_data.shape
(140, 5)
Construct and fit the model¶
We are going to use a logistic regression model, which implements a multinomial logistic regression. We'll dive into the details of this model in a future lecture. For now, we're using it as an example to get some experience working with data and scikit-learn.
There are many different types of models. In a machine learning class you would try several models and choose the one that works best. You can even code an algorithm to develop and test different models. Here, we are going to just use one model and focus on the training-testing paradigm.
Our logistic regression model has parameters $\beta$ for each characteristic and each species; ie, $\beta_{pw,set}$ is the coefficient on petal width for the setosa. We have 4 characteristics and 3 species, so we have 12 parameters. The parameters and the characteristics determine a probability that an observation is a species.
$$pr(set, i) = \beta_{pw, set}pw_i + \beta_{pl, set}pl_i + \beta_{sw, set}sw_i+\beta_{sl,set}sl_i$$$$pr(ver, i) = \beta_{pw, ver}pw_i + \beta_{pl, ver}pl_i + \beta_{sw,ver}sw_i+\beta_{sl,ver}sl_i$$$$pr(vir, i) = \beta_{pw, vir}pw_i + \beta_{pl, vir}pl_i + \beta_{sw, vir}sw_i+\beta_{sl,vir}sl_i$$For an observation $i$, the species with the highest probability score is chosen.
We use the training data to find the $\beta$s that make the model species prediction the best. This is technical process you would learn about in an advanced econometrics course or a machine learning course. We will let scikit learn handle it for us. Think of it as in OLS/ draw a line through the data exercise.
For such a complicated set up, the code is very short!
clf = LogisticRegression(random_state=0,solver='liblinear').fit(X_train, y_train)
# type(clf)
The LogisticRegression
object uses the .fit()
method and the training data to find the $\beta$s. The random_state option sets the random number generator seed so that we all get the same answer. This randomness is just a feature of the algorithm and how its algorithm works to find an answer.
clf
is an object with many attributes and methods. Let's see the $\beta$s.
clf.coef_
array([[ 0.42219371, 1.43295517, -2.23959505, -1.02350399], [ 0.2501995 , -1.40363835, 0.66059795, -1.36077578], [-1.61898762, -1.53188541, 2.37772222, 2.50630614]])
Each row is a set of $\beta$s for one species.
We use the predict()
method and the test data to see how well the model performs on data that was not used to estimate the $\beta$s.
clf.predict(X_test)
array(['Iris-virginica', 'Iris-versicolor', 'Iris-setosa', 'Iris-virginica', 'Iris-setosa', 'Iris-virginica', 'Iris-setosa', 'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor'], dtype=object)
How do the predictions compare to the test data?
y_test
array(['Iris-virginica', 'Iris-versicolor', 'Iris-setosa', 'Iris-virginica', 'Iris-setosa', 'Iris-virginica', 'Iris-setosa', 'Iris-versicolor', 'Iris-versicolor', 'Iris-versicolor'], dtype=object)
Eyeballing the results, we observe that we missed only once; ie, we were correct on 10 of 10 (100!). Of course, 'eye-balling' with larger data sets isn't a great strategy so we can use the .score()
method, and the test data, to find the fraction of observations in the test data that the model predicted correctly.
clf.score(X_test, y_test)
1.0
and we observe that we were correct in 90% of our data. Are we done? With this example, yes but it general we would loop through several more times through the train-test stages and each time selecting a different random split of the data. We would then evaluate our model specification via the average 'score'.
The careful reader will note that this algorithm addresses over-fitting via the random selection of training-test data but doesn't really address how good our specification is since the model specification is fixed. We're therefore not really addressing 'bias' yet. A task for another day.
To recap, here's the algorithm to address over-fitting:
- Preliminary checks of data. Look for patterns.
- Split data into test and train datasets
- Fit the model to the train dataset
- Evaluate the model using the test dataset
If the model does well predicting the test data, the model is ready to use. If the model does not do well, you would try different models or modify the model.
Practice¶
Let's look at characteristics and quality rankings of Portuguese wine from this data set. I've also posted the data on our course website. The UCI Machine Learning Repository has lots of (pre-cleaned) data sets. Many (all?) of them have been used in research and have been donated by their authors.
Our goal is to use the characteristics of the wines to predict the quality.
- Load the 'winequality-red.csv' file. Note, the separator is a semi-colon, not a comma in the underlying data file.
wine = pd.read_csv('./Data/winequality-red.csv', sep=';')
wine.sample(5)
fixed acidity | volatile acidity | citric acid | residual sugar | chlorides | free sulfur dioxide | total sulfur dioxide | density | pH | sulphates | alcohol | quality | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
543 | 11.2 | 0.660 | 0.24 | 2.5 | 0.085 | 16.0 | 53.0 | 0.9993 | 3.06 | 0.72 | 11.0 | 6 |
376 | 11.5 | 0.450 | 0.50 | 3.0 | 0.078 | 19.0 | 47.0 | 1.0003 | 3.26 | 1.11 | 11.0 | 6 |
263 | 7.9 | 0.370 | 0.23 | 1.8 | 0.077 | 23.0 | 49.0 | 0.9963 | 3.28 | 0.67 | 9.3 | 5 |
271 | 11.5 | 0.180 | 0.51 | 4.0 | 0.104 | 4.0 | 23.0 | 0.9996 | 3.28 | 0.97 | 10.1 | 6 |
345 | 7.0 | 0.685 | 0.00 | 1.9 | 0.067 | 40.0 | 63.0 | 0.9979 | 3.60 | 0.81 | 9.9 | 5 |
- How many characteristics ("features") are there? Lots of chemistry...
# The features are all the columns except quality.
# There are 11 features.
print(wine.columns.to_list(), '\n')
print('Number of features:', len(wine.columns)-1)
['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar', 'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density', 'pH', 'sulphates', 'alcohol', 'quality'] Number of features: 11
- The variable
quality
is discrete. What values does it take?
print('The values that quality can take are:', sorted(wine['quality'].unique()) )
The values that quality can take are: [3, 4, 5, 6, 7, 8]
- Try a facet plot, setting hue to be
quality
.There are enough characteristics in this dataset that this image is kind of absurdly large. We can see how things get harder the more characteristics we have. Do you see any patterns? Note: The first time I run the command I get an error. The second time, it runs fine. I have no idea why.g = sns.pairplot(wine, hue='quality')
g = sns.pairplot(wine, hue='quality')
- Break up the wine data into testing and training datasets. We did it manually before. Now, let's do it the easy way. scikit gives us the
train_test_split()
method which takes a dataset and breaks it up for us. Import the method using
from sklearn.model_selection import train_test_split
The syntax for the method is below. Replace X_data
with a DataFrame that only contains the features. Replace the y_data
with a Series (one column) that only holds the quality data.
test_size
is the size of the testing dataset: 20 percent of the observations are used for testing and 80 percent are used for training. Leave it set to 0.20 for now.
random_state
sets the seed for the random number generator. We are randomly assigning 20 percent of the observations to the test dataset. Setting the state ensures that we all split our dataset the same way. Leave it set to 12 for now.
X_train, X_test, y_train, y_test = train_test_split(
X_data,
y_data,
test_size=0.20,
random_state=12)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
wine.drop('quality',axis=1),
wine['quality'],
test_size=0.20,
random_state=12)
- Fit the model to your training data.
res = LogisticRegression(random_state=0,solver='liblinear').fit(X_train, y_train)
Leave the random_state =0
so we all get the same answer.
wine_clf = LogisticRegression(random_state=0,solver='liblinear').fit(X_train, y_train)
You most likely received a pink box warning. The problem is that the complicated algorithm for finding the $\beta$s did not complete successfully.
The issue is a technical one that is beyond the scope of this notebook, but idea is that our characteristics have very different means and variances (check this). Such variation in units can cause problems in minmization routines which employ algorithms to find the solution (vs OLS and 2SLS where there are analytic solutions).
We fix this by rescaling our characteristics so that they all have mean=0 and variance=1. We'll address later why/when this will be important. For now, just run the following code.
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
Now our X
variables have mean=0 and variance=1.
from sklearn import preprocessing
scaler = preprocessing.StandardScaler().fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
- Fit the model again -- as you did in Step 6.
wine_clf = LogisticRegression(random_state=0,solver='liblinear').fit(X_train, y_train)
- How well does the model predict the test data? Use the
.score()
method and the test data to check.
score = wine_clf.score(X_test, y_test)
print(f'RESULTS: Our logistic model was correct {score*100:.2f} percent of the time.')
RESULTS: Our logistic model was correct 60.31 percent of the time.
Is this a good score? It certainly is not as good as the iris model. If we really wanted to figure this out, we would try different models, experiment with different sets of characteristics and dig deeper into the data.
Our goal today was just to get a sense of what all this machine learning talk is about. In future lectures we'll introduce methods to improve our set validation techniques as well as introduce better models.
- Go back to the step where you split the data into training and testing data. Change the
random_state
from 12 to a different number (or just exclude the option entirely). Do your results change?
# See code below.
- Loop through different training and testing random splits. Record the prediction for each loop. Plot the results in a scatter plot. For example, if you loop through 100 different values of training/testing data, you will have 100 prediction scores. Comment on the figure.
N = 100 # Number of random splits
record = []
for i in range(N):
X_train, X_test, y_train, y_test = train_test_split(
wine.drop('quality',axis=1),
wine['quality'],
test_size=0.20)
# Fit model based on training data
wine_clf = LogisticRegression(random_state=0,solver='liblinear').fit(X_train, y_train)
# Record performance on testing data
score = wine_clf.score(X_test, y_test)
record.append(score*100) # Note: I'm converting to percent.
avg = np.mean(record)
sd = np.std(record)
maxs = np.max(record)
mins = np.min(record)
iqr = np.quantile(record, .75) - np.quantile(record, .25)
# Create Figure
fig, ax = plt.subplots(figsize=(15,10))
ax.scatter(range(N),record, alpha = 0.5, color = 'red')
ax.axhline(xmin=0, xmax=N-1, y=avg, color='blue', linewidth=2.0, dashes=[3,3],label=f'Average ({avg:.2f})\nStd ({sd:.2f})\nMax ({maxs:.2f})\nMin ({mins:.2f})')
ax.legend(frameon=False,loc='upper left',fontsize=16)
ax.set_xlim(-1, N+1)
ax.set_ylim(45, 70)
ax.set_xlabel('\nIteration',size=18)
ax.set_ylabel('Model Performance (% Correct)\n',size=18)
ax.set_title('Model Performance with Repeated\n Random Training/Test Splits',size=20)
ax.tick_params(axis='x', length=0)
ax.tick_params(axis='y', length=0)
ax.set_yticklabels(['{:.0f}%'.format(int(x)) for x in ax.get_yticks().tolist()],size=18)
ax.set_xticklabels(['{:.0f}'.format(int(x)) for x in ax.get_xticks().tolist()],size=18)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
C:\Users\jt83241\AppData\Local\Temp\ipykernel_26296\3019982101.py:23: UserWarning: FixedFormatter should only be used together with FixedLocator ax.set_yticklabels(['{:.0f}%'.format(int(x)) for x in ax.get_yticks().tolist()],size=18) C:\Users\jt83241\AppData\Local\Temp\ipykernel_26296\3019982101.py:24: UserWarning: FixedFormatter should only be used together with FixedLocator ax.set_xticklabels(['{:.0f}'.format(int(x)) for x in ax.get_xticks().tolist()],size=18)
Comments: The random sample has little effect on model performance: i.e., small standard deviation.