Lecture 20: Linear Regression via Ordinary Least Squares (OLS)
[MY SOLUTIONS]
This semester we've spent time adding tools to your toolbox. While we have spent a lot of time looking at economic data, we haven't spent much time doing "economics." Well, all of this work was a big wind-up to start using economomic theory to make sense of data. There are two ways in which we'll do this:
Econometrics
Machine Learning
This notebook introduces us to Econometrics and the statsmodels package (docs), which provides functions for formulating and estimating statistical models. Econometrics is not a prerequisite for this course so will spend some time motivating/ proving the tools. If you have had econometrics, we will apply what you learned in econometrics class to use in python.
The agenda for today's lecture is as follows:
Class Announcements
Exam 2 is due to eLC by end of day Thursday.
No class on Thursday.
Inference (Econometrics) vs. Prediction (Machine Learning)¶
Suppose we have data on a variable of interest $y$ and variables we think are important for determining $y$, $x_1, x_2,...x_p$. We assume that $y$ is related to the $x$ variables according to
$$y = f(X)+\epsilon$$where $X$ is the matrix of $x$ variables. The $f()$ function represents the systematic or structural relationship between $y$ and $X$ and $\epsilon$ is the noise or error term.
An important part of both machine learning and econometrics is estimating $f()$. Why we care about $f()$ is where machine learning and econometrics often (but not always) diverge.
- **Econometrics** prioritizes inference: The estimate of $f()$ is important in that it tells us about how $y$ and $X$ are related. An economist may be interested in understanding the (causal) relationship between interest rates (influenced directly by the FOMC) and the S&P 500 index (as a proxy for the economy). Inference is often associated with some kind of theoretical model in mind that lays out relationships between the $y$ and $X$ variables. The estimate of $f()$ helps us quantify and better understand these models.
- **Machine learning** prioritizes prediction: The estimate of $f()$ is only important in that it is useful for out-of-sample prediction. For example, a 'trained' ML model will be capable of predicting the value of the S&P 500 index tomorrow but the user is not inherently interested in why the S&P 500 index will change.
Recent work has focused on leveraging machine learning tools to better infer causality (econometrics) so the distinction between the two is becoming blurry.
Specifying f( )¶
How do we know which variables to include in $X$, or what is $f(\; )$? In econometrics, we use a model to guide our choice of $f()$. This reflects researcher's goal of inference. For example, if we're interested in understanding the causal relationship between interest rates and the S&P 500, interest rates had better be in the regression on the right-hand side. Plus, we'll want our estimates to reflect some deeper economic idea such as an elasticity. Doing so will enable us to port our results to other settings, including the large and more complicated equilibrium models of the economy used by the Federal Reserve to inform policy.
For this class, we will assume that $f(x)$ is linear so our assumed "data generating process" takes the following form:
$$y = X\beta+\epsilon$$where $X$ is some data and $\beta$ is the unknown to us and we want to use the data to estimate.
Identifying $\beta$: A Motivating Simple Example¶
Supose we have $j=1,...,N$ observations of some outcome variable we're interested in ($y$) and $j=1,...,N$ observations of some variable $X$ which we think influences $y$. We want to know exactly how; ie, want to find $\beta$. The world is a complicated place so we also think there's some random noise out there so include an error term ($\epsilon$). Our specification is therefore
$$y_j = X_j\beta+\epsilon_j$$How to find $\beta$?
Monte Carlo Simulation¶
Normally, we are presented with data, we assume a functional form for $f()$, and we solve for the $\beta$ such that our specification best represents the observed data. Our results could therefore be wrong if we specificied the wrong $f()$. Let's ignore that possibility and generate our own data so we know the true Data Generating Process (DGP) and therefore the true value of $\beta$. This kind of approach is called Monte Carlo Simulation. There is an optional lecture on the course website about how to properly construct random draws.
# Initialize packages
import numpy as np
import numpy.random as npr
import matplotlib.pyplot as plt
import seaborn as sea
import pandas as pd
# Set seed for random number generator to ensure the same "random" numbers are generated each time we run this code.
np.random.seed(10)
# Set values
beta = 1
N = 100
# Generate J random draws with an exponential distribution
X = npr.exponential(1,size=N)
# Generate J random draws with an normal distribution
eps = npr.normal(0,1,size=N)
# Determine Y
Y = X*beta + eps
# Create Figure
fig, ax = plt.subplots(figsize=(10,5))
ax.scatter(Y,X, alpha = 0.4, color = 'red')
ax.set_xlabel('Independent Variable (X)',size=18)
ax.set_ylabel('Dependent Variable (Y)',size=18)
ax.set_title(r'$Y_j = \beta X_j + \epsilon_j,\;\beta$='+str(beta),size=20)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()
Let's pretend that we don't know the true $\beta$ is equal to one.
We want to find the $\beta$ which best fits the data. Since our assumed $f()$ is linear, this amounts to drawing a line through the above dots, beginning at the origin. We this line to go through as many of the dots as possible. In elementary school you would have just positioned your ruler at (0,0) and rotated it unytil you found a good enough line. We're going to do the same thing here but with math.
Finding the 'best' line amounts to finding the line which minimizes the (squared) distance between the line and the red dots. Mathematically, this amounts to solving
$$\text{min}_{\beta}\sum_j \bigg[Y_j - \underbrace{\big(X_j\beta+\epsilon_j\big)}_{\text{Model}}\bigg]^2$$We squared the distance because we're indifferent between whether the line misses a red dot from above or below. Also note that the above specification lends itself to the name of this regression: "Least Squares". The "ordinary" refers to this approach being the starting point for regression models.
We can solve the $\beta$ using calculus; i.e., by taking a derivative and setting equal to zero:
$$ \sum_j \bigg[Y_j - \big(X_j\beta+\epsilon_j\big)\bigg]\times X_j=0$$which is equivalent to
$$ \sum_j Y_jX_j - X_jX_j\beta -X_j\epsilon_j=0$$$$ \hookrightarrow \sum_j Y_jX_j -\sum_jX_j\epsilon_j=\beta\sum_jX_jX_j$$But we can do better yet. Remember, we've assumed in our specification that $\epsilon$ is random noise. Since it's random noise, it cannot be correlated with $X$ so we know the interaction between the two ($\sum_jX_j\epsilon_j$) is equal to zero in expectation (or equivalently when the number of observations are suffiicently high).
We therefore can derive the value of $\beta$ directly from the First-Order Condition:
$$\hat\beta=\frac{\sum_j Y_jX_j}{\sum_jX_jX_j}$$We can check this with our graph:
numerator = np.sum(Y*X)
denominator = np.sum(X*X)
bhat = numerator/denominator
Yhat = X*bhat
# Create Figure
fig, ax = plt.subplots(figsize=(10,5))
ax.scatter(Y,X, alpha = 0.4, color = 'red')
ax.plot(Yhat,X, color = 'blue')
ax.set_xlabel('Independent Variable (X)')
ax.set_ylabel('Dependent Variable (Y)')
ax.set_title(r'Proposed DGP with $\beta$='+str(beta),size=20)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.annotate(fr'$\hat\beta=${bhat:.2f}', (2.2,3.5),size=16, color='blue')
plt.show()
Well, we did pretty good at recovering the true $\beta$ but didn't get it exactly. Why not? Well, not that above I said "in expectation" regarding the interaction between $X$ and $\epsilon$ and set it to zero. If that interaction wasn't exactly zero, my calculated $\hat\beta$ will be off. hence, our work is just an estimate of the coefficient, and we would therefore like to understand the variance of this estimate to quantify our uncertainty and possibly to perform significance tests.
We can actually derive an explicit function to represent the variance of our estimate -- where our uncertainty will stem from how good and how much $X$ we have. Mathematically, define the variance as
$$V(\hat\beta)= \frac{1}{N}\sum_j \bigg[\big(\beta-\hat\beta\big)X_j\bigg]^2$$$$\hookrightarrow V(\hat\beta)= \frac{\sigma^2}{\sum_j X_j^2} $$where $\sigma^2 = \frac{\sum_j\epsilon_j^2}{N-p}$ and $p$ is the 'number of parameters estimated'. We refer to the denominator as the 'degrees of freedom.' We don't know the true epsilon, but our estimated equation gives us an estimate so use that to approximate the true value. Hence, our estimated variance of the estimate is
$$V(\hat\beta)= \frac{\sum_j\hat\epsilon_j^2}{\sum_j X_j^2} $$We define the standard error as the square root of $V(\hat\beta)$ and we can easilly compute this given our estimation. Moreover, we can test whether our estimate of $\beta$ is significantly different to a 'null hypothesis' (typically set as zero meaning no relationship exists and it could be that $\beta=0$) using a t-test:
$$t = \frac{\hat\beta-\beta^0}{\text{se}(\hat\beta)}\sim T_{N-p}$$where $T_{N-p}$ represents the T-distribution with 'N-p' degrees of freedom. BTW the story behind the 'student T-test' is a worthwhile read. A good rule-of-thumb is for the t-statistic to be greater than 2 which says that we can reject the null hypothesis with 95\% probability. In practice, we only report the estimated standard errors since that enables the reader to compute t-statistics for a variety of null hypotheses.
resid = Y-X*bhat
s = np.sum(resid**2)/(N-1)
var = s/(np.sum(X**2))
se = np.sqrt(var)
# Create Figure
fig, ax = plt.subplots(figsize=(10,5))
ax.scatter(Y,X, alpha = 0.4, color = 'red')
ax.plot(Yhat,X, color = 'blue')
ax.set_xlabel('Independent Variable (X)')
ax.set_ylabel('Dependent Variable (Y)')
ax.set_title(r'Proposed DGP with $\beta$='+str(beta),size=20)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.annotate(fr'$\hat\beta=${bhat:.2f} ({se:.4f})', (1.6,3.5),size=16, color='blue')
plt.show()
From the graph, we can see that our simple linear model replicates the data well and -- since we know the true DGP -- also recovers the true value of $\beta$.
In our simple example we only included on variable and therefore only solved for one coefficient. Going forward, our data will be much more complicated (e.g., more independent X variables) and we'll want to account for the possibility that our dependent variable attains some value independent of the observed X variables in our data -- we'll want to include a constant in our $f()$ specification. It turns out that doing all of this is pretty straight-forward and the intution gained from our simple example extends easilly.
Practice ¶
- Increase N to $\{200,500,1000\}$. What happens to the estimated coefficient $(\hat\beta)$ and the estimated standard errors $(\hat \epsilon)$?
# Set values
beta = 1
Nval = [100, 200, 500, 1000, 100000, 100000]
for n in Nval:
# Generate J random draws with an exponential distribution
X = npr.exponential(1,size=n)
# Generate J random draws with an normal distribution
eps = npr.normal(0,1,size=n)
# Determine Y
Y = X*beta + eps
numerator = np.sum(Y*X)
denominator = np.sum(X*X)
print(f'Given N of {n:d}, the value of bhat is: {numerator/denominator:.4f}')
Given N of 100, the value of bhat is: 1.0222 Given N of 200, the value of bhat is: 0.9870 Given N of 500, the value of bhat is: 0.9854 Given N of 1000, the value of bhat is: 0.9786 Given N of 100000, the value of bhat is: 0.9995 Given N of 100000, the value of bhat is: 1.0013
- Change the specification by changing $\beta$ to a value of your choice and re-run the code to find $\hat\beta$.
Nval = [100, 200, 500, 1000, 100000, 100000]
for n in Nval:
# Generate J random draws with an exponential distribution
X = npr.exponential(1,size=n)
# Generate J random draws with an normal distribution
eps = npr.normal(0,1,size=n)
# Determine Y
Y = X*1.5 + eps
numerator = np.sum(Y*X)
denominator = np.sum(X*X)
print(f'Given N of {n:d}, the value of bhat is: {numerator/denominator:.4f}')
Given N of 100, the value of bhat is: 1.4653 Given N of 200, the value of bhat is: 1.4602 Given N of 500, the value of bhat is: 1.4470 Given N of 1000, the value of bhat is: 1.4435 Given N of 100000, the value of bhat is: 1.4994 Given N of 100000, the value of bhat is: 1.4965
Reading Stata data files¶
If you took econometrics, you probably used one of Wooldridge's textbooks. I figure you would like to relive those happy times, so we'll use some examples from his textbook.
On the plus side, the data that correspond to the Wooldridge Introductory Econometrics: A Modern Approach problems are available to download and they are ALREADY CLEANED. [I contemplated adding some junk to the files to make it more interesting...]
On the minus side, the files are in STATA's .dta format.
Lucky for use, pandas has a method that reads stata files. It also has methods for csv, excel, json, SQL, SAS,...
We'll be looking at the connection between labor wages and sleep using a 1990 paper by Biddle and Hamermesh (link). The authors use time-use data from 12 countries to assess whether an increase in wage causes individuals to sacrifice sleep.
Take note of a couple things:
The "economic mechanism" is that people are time-constrained so when their wages go up, they sacrifice sleep to earn more. If we were to write down a consumer optimization problem, utility would be increasing in sleep and consumption; ie, $$ max_{s,c} \,U(\underbrace{T-n}_{\text{sleep}},c)\;\;\text{s.t.}\; T \geq n + s, \underbrace{w\times n}_{\text{income}}\geq \underbrace{p\times c}_{\substack{\text{Money spent}\\ \text{on cons.}}} $$
They want to show "causation" but all we'll be able to show with OLS is correlation.
Let's load the underlying data:
# Use pandas read_stata method to get the stata formatted data file into a DataFrame.
sleep = pd.read_stata('./Data/sleep75.dta')
# Take a look...so clean!
sleep.head()
age | black | case | clerical | construc | educ | earns74 | gdhlth | inlf | leis1 | ... | spwrk75 | totwrk | union | worknrm | workscnd | exper | yngkid | yrsmarr | hrwage | agesq | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 32.0 | 0.0 | 1.0 | 0.0 | 0.0 | 12.0 | 0.0 | 0.0 | 1.0 | 3529.0 | ... | 0.0 | 3438.0 | 0.0 | 3438.0 | 0.0 | 14.0 | 0.0 | 13.0 | 7.070004 | 1024.0 |
1 | 31.0 | 0.0 | 2.0 | 0.0 | 0.0 | 14.0 | 9500.0 | 1.0 | 1.0 | 2140.0 | ... | 0.0 | 5020.0 | 0.0 | 5020.0 | 0.0 | 11.0 | 0.0 | 0.0 | 1.429999 | 961.0 |
2 | 44.0 | 0.0 | 3.0 | 0.0 | 0.0 | 17.0 | 42500.0 | 1.0 | 1.0 | 4595.0 | ... | 1.0 | 2815.0 | 0.0 | 2815.0 | 0.0 | 21.0 | 0.0 | 0.0 | 20.530001 | 1936.0 |
3 | 30.0 | 0.0 | 4.0 | 0.0 | 0.0 | 12.0 | 42500.0 | 1.0 | 1.0 | 3211.0 | ... | 1.0 | 3786.0 | 0.0 | 3786.0 | 0.0 | 12.0 | 0.0 | 12.0 | 9.619998 | 900.0 |
4 | 64.0 | 0.0 | 5.0 | 0.0 | 0.0 | 14.0 | 2500.0 | 1.0 | 1.0 | 4052.0 | ... | 1.0 | 2580.0 | 0.0 | 2580.0 | 0.0 | 44.0 | 0.0 | 33.0 | 2.750000 | 4096.0 |
5 rows × 34 columns
Here is some additional information about the variables:
Variable | Description | Variable | Description | Variable | Description | Variable | Description |
---|---|---|---|---|---|---|---|
age | in years | leis1 | sleep - totwrk | rlxall | slpnaps + personal activs | worknrm | mins work main job |
black | =1 if black | leis2 | slpnaps - totwrk | selfe | =1 if self employed | workscnd | mins work second job |
case | identifier | leis3 | rlxall - totwrk | sleep | mins sleep at night, per wk | exper | age - educ - 6 |
clerical | =1 if clerical worker | smsa | =1 if live in smsa | slpnaps | minutes sleep, inc. naps | yngkid | =1 if children < 3 present |
construc | =1 if construction worker | lhrwage | log hourly wage | south | =1 if live in south | yrsmarr | years married |
educ | years of schooling | lothinc | log othinc, unless othinc < 0 | spsepay | spousal wage income | hrwage | hourly wage |
earns74 | total earnings, 1974 | male | =1 if male | spwrk75 | =1 if spouse works | agesq | age^2 |
gdhlth | =1 if in good or excel. health | marr | =1 if married | totwrk | mins worked per week | ||
inlf | =1 if in labor force | prot | =1 if Protestant | union | =1 if belong to union |
Describing models with patsy¶
The patsy
package provides us with a formulaic syntax for defining models that uses strings. The basic syntax is
y ~ x1 + x2
which describes the model
$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \epsilon$.
Notice that I did not specify the constant. Patsy takes care of that automatically. Let's start slow and build up the regression, then we will see how to do it all in one shot. You'll have to install patsy with pip:
pip install patsy
We start by passing our model, specified in a patsy string to patsy.dmatrices()
to create the design matrices. Our model is
.
NB, This is Wooldridge problem 3, chapter 3.
import patsy # provides a syntax for specifying models
# Pass the model formula and the associated data to create design matrices
y, X = patsy.dmatrices('sleep ~ totwrk + educ + age', sleep)
# What do we have?
print('X and y are of type:' , type(X), type(y))
X and y are of type: <class 'patsy.design_info.DesignMatrix'> <class 'patsy.design_info.DesignMatrix'>
X
DesignMatrix with shape (706, 4) Intercept totwrk educ age 1 3438 12 32 1 5020 14 31 1 2815 17 44 1 3786 12 30 1 2580 14 64 1 1205 12 41 1 2113 12 35 1 3608 13 47 1 2353 17 32 1 2851 15 30 1 6415 8 43 1 370 16 23 1 2438 16 24 1 2693 5 48 1 2526 12 33 1 2950 12 23 1 3003 17 46 1 4011 14 37 1 2300 12 53 1 1543 17 45 1 3473 17 46 1 3276 13 40 1 2506 12 53 1 2651 13 29 1 4580 12 29 1 3588 12 53 1 3418 13 28 1 2250 12 35 1 2638 12 36 1 3173 12 59 [676 rows omitted] Terms: 'Intercept' (column 0) 'totwrk' (column 1) 'educ' (column 2) 'age' (column 3) (to view full data, use np.asarray(this_obj))
y
DesignMatrix with shape (706, 1) sleep 3113 2920 2670 3083 3448 4063 3180 2928 3368 3018 1575 3295 3798 3008 3248 3683 3201 2580 3420 3090 2760 2880 3470 2673 2820 2873 1905 2926 2603 3238 [676 rows omitted] Terms: 'sleep' (column 0) (to view full data, use np.asarray(this_obj))
So X
holds the independent variables and y
holds the dependent variable. Note the addition of the intercept (the column of ones) to the X
matrix.
Building and estimating the model in statsmodels
¶
With the design matrices in-hand, we can build ordinary least squares (OLS) model in the python package statsmodels
which you can install via
pip install statsmodels
import statsmodels.api as sm # provides statistical models like ols, gmm, anova, etc...
# Pass design matrices to OLS to specifyan OLS model
sleep_model = sm.OLS(y, X)
type(sleep_model)
statsmodels.regression.linear_model.OLS
We can now estimate the model using the .fit( )
method of the statsmodel object.
res = sleep_model.fit() # Estimate the model and store the results in res
# To see the summary report
print(res.summary())
OLS Regression Results ============================================================================== Dep. Variable: sleep R-squared: 0.113 Model: OLS Adj. R-squared: 0.110 Method: Least Squares F-statistic: 29.92 Date: Tue, 01 Nov 2022 Prob (F-statistic): 3.28e-18 Time: 15:13:19 Log-Likelihood: -5263.1 No. Observations: 706 AIC: 1.053e+04 Df Residuals: 702 BIC: 1.055e+04 Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 3638.2453 112.275 32.405 0.000 3417.810 3858.681 totwrk -0.1484 0.017 -8.888 0.000 -0.181 -0.116 educ -11.1338 5.885 -1.892 0.059 -22.687 0.420 age 2.1999 1.446 1.522 0.129 -0.639 5.038 ============================================================================== Omnibus: 68.731 Durbin-Watson: 1.943 Prob(Omnibus): 0.000 Jarque-Bera (JB): 185.551 Skew: -0.496 Prob(JB): 5.11e-41 Kurtosis: 5.308 Cond. No. 1.66e+04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.66e+04. This might indicate that there are strong multicollinearity or other numerical problems.
The more you work, the less you sleep. Or is it the other way around. I feel better already.
We can retrieve other results too.
print('The estimated coefficients are:', res.params, '\n')
print('The confidence intervals are:\n', res.conf_int(), '\n')
print(f'The r-squared is: {res.rsquared:.4f}')
The estimated coefficients are: [ 3.63824531e+03 -1.48373438e-01 -1.11338131e+01 2.19988481e+00] The confidence intervals are: [[ 3.41781008e+03 3.85868055e+03] [-1.81148684e-01 -1.15598192e-01] [-2.26872877e+01 4.19661463e-01] [-6.38561308e-01 5.03833093e+00]] The r-squared is: 0.1134
Directly specifying and estimating models with the formula.api¶
We have built our model from the ground up
- Create the design matrices with patsy
- Create the model with statsmodel
- Fit the model and obtain results
That was helpful to get a sense of what is going on, but we can do all those steps in one line of code. We just pass the patsy string and the data directly to statsmodels and call fit.
To do this, we use the statsmodels.formula.api
method. The formula api interprets the patsy formula for us and creates the design matrices.
import statsmodels.formula.api as smf # provides a way to directly spec models from formulas
res = smf.ols('sleep ~ totwrk + educ + age', data=sleep).fit()
print(res.summary())
OLS Regression Results ============================================================================== Dep. Variable: sleep R-squared: 0.113 Model: OLS Adj. R-squared: 0.110 Method: Least Squares F-statistic: 29.92 Date: Tue, 01 Nov 2022 Prob (F-statistic): 3.28e-18 Time: 15:13:19 Log-Likelihood: -5263.1 No. Observations: 706 AIC: 1.053e+04 Df Residuals: 702 BIC: 1.055e+04 Df Model: 3 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 3638.2453 112.275 32.405 0.000 3417.810 3858.681 totwrk -0.1484 0.017 -8.888 0.000 -0.181 -0.116 educ -11.1338 5.885 -1.892 0.059 -22.687 0.420 age 2.1999 1.446 1.522 0.129 -0.639 5.038 ============================================================================== Omnibus: 68.731 Durbin-Watson: 1.943 Prob(Omnibus): 0.000 Jarque-Bera (JB): 185.551 Skew: -0.496 Prob(JB): 5.11e-41 Kurtosis: 5.308 Cond. No. 1.66e+04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.66e+04. This might indicate that there are strong multicollinearity or other numerical problems.
Data transformations¶
Patsy can handle common (and less common) regression tasks. Here are a few
Interacting variables¶
Use '*' to interact two variables. Patsy will also include the two variables individually, too. Below, we interact education an age
$$ sleep = \beta_0 + \beta_1 totwrk + \beta_2 educ + \beta_3 age + \beta_4 age\times educ + \epsilon $$.
res = smf.ols('sleep ~ totwrk + educ*age', data=sleep).fit()
print(res.summary())
OLS Regression Results ============================================================================== Dep. Variable: sleep R-squared: 0.119 Model: OLS Adj. R-squared: 0.114 Method: Least Squares F-statistic: 23.67 Date: Tue, 01 Nov 2022 Prob (F-statistic): 2.24e-18 Time: 15:13:19 Log-Likelihood: -5260.9 No. Observations: 706 AIC: 1.053e+04 Df Residuals: 701 BIC: 1.055e+04 Df Model: 4 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 3081.4242 286.426 10.758 0.000 2519.069 3643.780 totwrk -0.1444 0.017 -8.618 0.000 -0.177 -0.112 educ 31.5246 21.032 1.499 0.134 -9.769 72.818 age 14.9293 6.197 2.409 0.016 2.763 27.096 educ:age -1.0065 0.477 -2.112 0.035 -1.942 -0.071 ============================================================================== Omnibus: 66.086 Durbin-Watson: 1.933 Prob(Omnibus): 0.000 Jarque-Bera (JB): 178.298 Skew: -0.475 Prob(JB): 1.92e-39 Kurtosis: 5.271 Cond. No. 4.32e+04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 4.32e+04. This might indicate that there are strong multicollinearity or other numerical problems.
Using logs¶
We use the np.log( )
method directly in the patsy syntax. Note that we loaded the numpy package above as np. Below, we use the logarithm of age in the model
.
res = smf.ols('sleep ~ totwrk + educ + np.log(age)', data=sleep).fit()
print(res.summary())
OLS Regression Results ============================================================================== Dep. Variable: sleep R-squared: 0.113 Model: OLS Adj. R-squared: 0.109 Method: Least Squares F-statistic: 29.79 Date: Tue, 01 Nov 2022 Prob (F-statistic): 3.91e-18 Time: 15:13:19 Log-Likelihood: -5263.3 No. Observations: 706 AIC: 1.053e+04 Df Residuals: 702 BIC: 1.055e+04 Df Model: 3 Covariance Type: nonrobust =============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------- Intercept 3440.9308 239.448 14.370 0.000 2970.811 3911.050 totwrk -0.1489 0.017 -8.924 0.000 -0.182 -0.116 educ -11.3493 5.881 -1.930 0.054 -22.896 0.197 np.log(age) 79.2414 56.612 1.400 0.162 -31.908 190.391 ============================================================================== Omnibus: 68.734 Durbin-Watson: 1.944 Prob(Omnibus): 0.000 Jarque-Bera (JB): 184.854 Skew: -0.497 Prob(JB): 7.24e-41 Kurtosis: 5.301 Cond. No. 3.61e+04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 3.61e+04. This might indicate that there are strong multicollinearity or other numerical problems.
Fixed effects¶
When I was a kid, we called these dummy variables. Gender is coded {0,1} in the variable male. Let's suppose it was coded as a string.
sleep['gender'] = sleep['male'].replace({1.0:'male', 0.0:'female'})
res = smf.ols('sleep ~ totwrk + educ + age + C(gender)', data=sleep).fit()
print(res.summary())
OLS Regression Results ============================================================================== Dep. Variable: sleep R-squared: 0.122 Model: OLS Adj. R-squared: 0.117 Method: Least Squares F-statistic: 24.26 Date: Tue, 01 Nov 2022 Prob (F-statistic): 8.02e-19 Time: 15:13:19 Log-Likelihood: -5259.8 No. Observations: 706 AIC: 1.053e+04 Df Residuals: 701 BIC: 1.055e+04 Df Model: 4 Covariance Type: nonrobust ===================================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------- Intercept 3642.4666 111.844 32.567 0.000 3422.877 3862.056 C(gender)[T.male] 87.9933 34.323 2.564 0.011 20.604 155.382 totwrk -0.1658 0.018 -9.230 0.000 -0.201 -0.131 educ -11.7561 5.866 -2.004 0.045 -23.274 -0.238 age 1.9643 1.443 1.361 0.174 -0.869 4.797 ============================================================================== Omnibus: 65.308 Durbin-Watson: 1.942 Prob(Omnibus): 0.000 Jarque-Bera (JB): 174.107 Skew: -0.473 Prob(JB): 1.56e-38 Kurtosis: 5.241 Cond. No. 1.66e+04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 1.66e+04. This might indicate that there are strong multicollinearity or other numerical problems.
Including a fixed effect for gender means that we interpret the coefficient on the dummy as relative to the reference group (here, women); ie, since the coefficient is positive, men sleep more than women, ceteris paribus. Equivalently, the intercept is the value of the female fixed effect.
No intercept¶
Use -1
to kill the automatic intercept and force the regression to go through the origin (as in our simple example).
res = smf.ols('sleep ~ totwrk + educ + age + C(gender) -1', data=sleep).fit()
print(res.summary())
OLS Regression Results ============================================================================== Dep. Variable: sleep R-squared: 0.122 Model: OLS Adj. R-squared: 0.117 Method: Least Squares F-statistic: 24.26 Date: Tue, 01 Nov 2022 Prob (F-statistic): 8.02e-19 Time: 15:13:19 Log-Likelihood: -5259.8 No. Observations: 706 AIC: 1.053e+04 Df Residuals: 701 BIC: 1.055e+04 Df Model: 4 Covariance Type: nonrobust ===================================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------------- C(gender)[female] 3642.4666 111.844 32.567 0.000 3422.877 3862.056 C(gender)[male] 3730.4598 117.475 31.755 0.000 3499.816 3961.104 totwrk -0.1658 0.018 -9.230 0.000 -0.201 -0.131 educ -11.7561 5.866 -2.004 0.045 -23.274 -0.238 age 1.9643 1.443 1.361 0.174 -0.869 4.797 ============================================================================== Omnibus: 65.308 Durbin-Watson: 1.942 Prob(Omnibus): 0.000 Jarque-Bera (JB): 174.107 Skew: -0.473 Prob(JB): 1.56e-38 Kurtosis: 5.241 Cond. No. 2.37e+04 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified. [2] The condition number is large, 2.37e+04. This might indicate that there are strong multicollinearity or other numerical problems.
Practice ¶
Wooldridge problem C2 in chapter 6.
- Load
wage1.dta
- Generate a scatter plot of log(wage) on education.
wage1 = pd.read_stata('./Data/wage1.dta')
# There is already a log(wage) variable in the data set (lwage) but we can always use some practice with transforms
wage1['lnwage'] = wage1['wage'].apply(np.log)
wage1.head()
wage | educ | exper | tenure | nonwhite | female | married | numdep | smsa | northcen | ... | trade | services | profserv | profocc | clerocc | servocc | lwage | expersq | tenursq | lnwage | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 3.10 | 11.0 | 2.0 | 0.0 | 0.0 | 1.0 | 0.0 | 2.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.131402 | 4.0 | 0.0 | 1.131402 |
1 | 3.24 | 12.0 | 22.0 | 2.0 | 0.0 | 1.0 | 1.0 | 3.0 | 1.0 | 0.0 | ... | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1.175573 | 484.0 | 4.0 | 1.175573 |
2 | 3.00 | 11.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | ... | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.098612 | 4.0 | 0.0 | 1.098612 |
3 | 6.00 | 8.0 | 44.0 | 28.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.791759 | 1936.0 | 784.0 | 1.791759 |
4 | 5.30 | 12.0 | 7.0 | 2.0 | 0.0 | 0.0 | 1.0 | 1.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.667707 | 49.0 | 4.0 | 1.667707 |
5 rows × 25 columns
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(wage1['educ'], wage1['lnwage'], marker='o', alpha = 0.5 )
ax.set_xlabel('education')
ax.set_ylabel('log wage')
sea.despine(ax=ax, trim=True)
- What's the relationship between education and log-wage? Estimate
$$ \log(wage) = \beta_0 + \beta_1 educ + \epsilon$$
Print your estimated coefficient on education. Is it statistically significant from zero? What's the interpretation?
res = smf.ols('np.log(wage) ~ educ', data=wage1).fit()
print(res.summary())
OLS Regression Results ============================================================================== Dep. Variable: np.log(wage) R-squared: 0.186 Model: OLS Adj. R-squared: 0.184 Method: Least Squares F-statistic: 119.6 Date: Tue, 01 Nov 2022 Prob (F-statistic): 3.27e-25 Time: 15:13:19 Log-Likelihood: -359.38 No. Observations: 526 AIC: 722.8 Df Residuals: 524 BIC: 731.3 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 0.5838 0.097 5.998 0.000 0.393 0.775 educ 0.0827 0.008 10.935 0.000 0.068 0.098 ============================================================================== Omnibus: 11.804 Durbin-Watson: 1.801 Prob(Omnibus): 0.003 Jarque-Bera (JB): 13.811 Skew: 0.268 Prob(JB): 0.00100 Kurtosis: 3.586 Cond. No. 60.2 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
- Scatter plot the residuals against education. The residuals are in the results object from the
fit
method.
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(wage1['educ'], res.resid, marker='o', alpha = 0.5 )
ax.set_xlabel('education')
ax.set_ylabel('residual')
sea.despine(ax=ax, trim=True)
Looks heteroskedastic. That's s word to impress your friends. It means the errors don't look like they're drawn from a single distribution. They're still normal and noise so our math and $\beta$ point-estimates are valid -- the errors may just be coming from a different distribution depending on whether someone is of high or low education. The consequence of heteroskedasticity is that our standard errors and corresponding hypothesis tests are wrong.
This is actually an easy fix (mathematically and programatically -- not a word I think) and we can change the covariance matrix type (which will correct the standard error calculation) using the cov_type
parameter (docs). The types of covariance matrices are described in the docs.
- Redo the estimation allowing for a more flexible variance-covariance matrix. Try 'HC3' for your covariance matrix type.
res = smf.ols('np.log(wage) ~ educ', data=wage1).fit(cov_type='HC3')
print(res.summary())
OLS Regression Results ============================================================================== Dep. Variable: np.log(wage) R-squared: 0.186 Model: OLS Adj. R-squared: 0.184 Method: Least Squares F-statistic: 111.7 Date: Tue, 01 Nov 2022 Prob (F-statistic): 8.48e-24 Time: 15:13:19 Log-Likelihood: -359.38 No. Observations: 526 AIC: 722.8 Df Residuals: 524 BIC: 731.3 Df Model: 1 Covariance Type: HC3 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 0.5838 0.099 5.872 0.000 0.389 0.779 educ 0.0827 0.008 10.569 0.000 0.067 0.098 ============================================================================== Omnibus: 11.804 Durbin-Watson: 1.801 Prob(Omnibus): 0.003 Jarque-Bera (JB): 13.811 Skew: 0.268 Prob(JB): 0.00100 Kurtosis: 3.586 Cond. No. 60.2 ============================================================================== Notes: [1] Standard Errors are heteroscedasticity robust (HC3)
- Scatter plot the data and add the regression line.
To plot the regression line you will need to create some x data and then apply the parameters. I used something like this
y = [p.Intercept + p.educ*i for i in x]
where p
holds the parameters from my results and x
holds the relavent independent variables from the specification. If you've had linear algebra, this is matric multiplcation between a coefficient vector and the independent X matrix.
fig, ax = plt.subplots(figsize=(15,10))
# Plot the data
ax.scatter(wage1['educ'], wage1['lnwage'], marker='o', alpha = 0.5 )
# Create the line of best fit to plot
p = res.params # params from the model fit
x = range(0,20) # some x data
y = [p.Intercept + p.educ*i for i in x] # apply the coefficients
ax.plot(x,y, color='black')
# build the string
text = 'regression line: \nlog(w) = {0:.2} + {1:.2} educ'.format(p.Intercept, p.educ)
# place the annotation
ax.text(0.5, 2.5, text, fontsize=14)
ax.set_xlabel('education', fontsize=14)
ax.set_ylabel('log wage', fontsize=14)
sea.despine(ax=ax, trim=True)
plt.show()
- Go back and add the text 'log(w) = a + b*educ' to the northwest corner of your plot. Replace a and b with the estimated parameter values.
# See above.
- Let's modify the regression to include education in logs.
Estimate the model and interpret your results. You may have to drop a couple observations with where education is zero (or add a one to all education numbers in the data) in order to generate nice output.
Hint: Totall differentiate our specification to get $\frac{dw}{w}=\beta_1\times\frac{de}{e}$. Rearrange such that $\beta_1 = \frac{dw}{de}\times \frac{e}{w}.$ What is the economic term for the RHS of that expression?
wage1= wage1[wage1['educ']>0]
res = smf.ols('np.log(wage) ~ np.log(educ)', data=wage1,missing='drop').fit(cov_type='HC3')
print(res.summary())
OLS Regression Results ============================================================================== Dep. Variable: np.log(wage) R-squared: 0.149 Model: OLS Adj. R-squared: 0.147 Method: Least Squares F-statistic: 53.97 Date: Tue, 01 Nov 2022 Prob (F-statistic): 7.89e-13 Time: 15:13:20 Log-Likelihood: -370.08 No. Observations: 524 AIC: 744.2 Df Residuals: 522 BIC: 752.7 Df Model: 1 Covariance Type: HC3 ================================================================================ coef std err z P>|z| [0.025 0.975] -------------------------------------------------------------------------------- Intercept -0.4447 0.283 -1.572 0.116 -0.999 0.110 np.log(educ) 0.8252 0.112 7.346 0.000 0.605 1.045 ============================================================================== Omnibus: 11.725 Durbin-Watson: 1.796 Prob(Omnibus): 0.003 Jarque-Bera (JB): 13.046 Skew: 0.287 Prob(JB): 0.00147 Kurtosis: 3.517 Cond. No. 29.6 ============================================================================== Notes: [1] Standard Errors are heteroscedasticity robust (HC3)
The education **elasticity** of wage is 0.8252.
- Let's add some additional regressors (aka RHS variables, independent variables, covariates). Estimate $$ \log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 exper^2 + \beta_4I_m + \epsilon$$
where $I_m$ is a variable equal to 1 if the worker is a married. Interpret your results.
res = smf.ols('np.log(wage) ~ educ + exper + np.power(exper,2) + C(married)', data=wage1).fit(cov_type='HC3')
print(res.summary())
OLS Regression Results ============================================================================== Dep. Variable: np.log(wage) R-squared: 0.314 Model: OLS Adj. R-squared: 0.309 Method: Least Squares F-statistic: 57.67 Date: Tue, 01 Nov 2022 Prob (F-statistic): 2.92e-40 Time: 15:13:20 Log-Likelihood: -313.47 No. Observations: 524 AIC: 636.9 Df Residuals: 519 BIC: 658.2 Df Model: 4 Covariance Type: HC3 ====================================================================================== coef std err z P>|z| [0.025 0.975] -------------------------------------------------------------------------------------- Intercept 0.0869 0.108 0.807 0.420 -0.124 0.298 C(married)[T.1.0] 0.1217 0.044 2.789 0.005 0.036 0.207 educ 0.0921 0.008 11.693 0.000 0.077 0.108 exper 0.0343 0.005 6.354 0.000 0.024 0.045 np.power(exper, 2) -0.0006 0.000 -5.166 0.000 -0.001 -0.000 ============================================================================== Omnibus: 5.733 Durbin-Watson: 1.793 Prob(Omnibus): 0.057 Jarque-Bera (JB): 7.698 Skew: 0.046 Prob(JB): 0.0213 Kurtosis: 3.587 Cond. No. 4.36e+03 ============================================================================== Notes: [1] Standard Errors are heteroscedasticity robust (HC3) [2] The condition number is large, 4.36e+03. This might indicate that there are strong multicollinearity or other numerical problems.
Wage is increasing in marriage, education, and experience. The marginal benefit of experience is decreasing, however.