Lecture20_OLS_finished

Lecture 20: Linear Regression via Ordinary Least Squares (OLS)
[MY SOLUTIONS]

This semester we've spent time adding tools to your toolbox. While we have spent a lot of time looking at economic data, we haven't spent much time doing "economics." Well, all of this work was a big wind-up to start using economomic theory to make sense of data. There are two ways in which we'll do this:

Econometrics
Machine Learning

This notebook introduces us to Econometrics and the statsmodels package (docs), which provides functions for formulating and estimating statistical models. Econometrics is not a prerequisite for this course so will spend some time motivating/ proving the tools. If you have had econometrics, we will apply what you learned in econometrics class to use in python.

The agenda for today's lecture is as follows:

Motivating OLS Regressions

Multivariate OLS Regression

Class Announcements

Exam 2 is due to eLC by end of day Thursday.
No class on Thursday.

1. Motivating OLS Regressions (top)¶

Inference (Econometrics) vs. Prediction (Machine Learning)¶

Suppose we have data on a variable of interest $y$ and variables we think are important for determining $y$, $x_1, x_2,...x_p$. We assume that $y$ is related to the $x$ variables according to

$$y = f(X)+\epsilon$$

where $X$ is the matrix of $x$ variables. The $f()$ function represents the systematic or structural relationship between $y$ and $X$ and $\epsilon$ is the noise or error term.

An important part of both machine learning and econometrics is estimating $f()$. Why we care about $f()$ is where machine learning and econometrics often (but not always) diverge.

**Econometrics** prioritizes inference: The estimate of $f()$ is important in that it tells us about how $y$ and $X$ are related. An economist may be interested in understanding the (causal) relationship between interest rates (influenced directly by the FOMC) and the S&P 500 index (as a proxy for the economy). Inference is often associated with some kind of theoretical model in mind that lays out relationships between the $y$ and $X$ variables. The estimate of $f()$ helps us quantify and better understand these models.

**Machine learning** prioritizes prediction: The estimate of $f()$ is only important in that it is useful for out-of-sample prediction. For example, a 'trained' ML model will be capable of predicting the value of the S&P 500 index tomorrow but the user is not inherently interested in why the S&P 500 index will change.

Recent work has focused on leveraging machine learning tools to better infer causality (econometrics) so the distinction between the two is becoming blurry.

Specifying f( )¶

How do we know which variables to include in $X$, or what is $f(\; )$? In econometrics, we use a model to guide our choice of $f()$. This reflects researcher's goal of inference. For example, if we're interested in understanding the causal relationship between interest rates and the S&P 500, interest rates had better be in the regression on the right-hand side. Plus, we'll want our estimates to reflect some deeper economic idea such as an elasticity. Doing so will enable us to port our results to other settings, including the large and more complicated equilibrium models of the economy used by the Federal Reserve to inform policy.

For this class, we will assume that $f(x)$ is linear so our assumed "data generating process" takes the following form:

$$y = X\beta+\epsilon$$

where $X$ is some data and $\beta$ is the unknown to us and we want to use the data to estimate.

Identifying $\beta$: A Motivating Simple Example¶

Supose we have $j=1,...,N$ observations of some outcome variable we're interested in ($y$) and $j=1,...,N$ observations of some variable $X$ which we think influences $y$. We want to know exactly how; ie, want to find $\beta$. The world is a complicated place so we also think there's some random noise out there so include an error term ($\epsilon$). Our specification is therefore

$$y_j = X_j\beta+\epsilon_j$$

How to find $\beta$?

Monte Carlo Simulation¶

Normally, we are presented with data, we assume a functional form for $f()$, and we solve for the $\beta$ such that our specification best represents the observed data. Our results could therefore be wrong if we specificied the wrong $f()$. Let's ignore that possibility and generate our own data so we know the true Data Generating Process (DGP) and therefore the true value of $\beta$. This kind of approach is called Monte Carlo Simulation. There is an optional lecture on the course website about how to properly construct random draws.

In [1]:

# Initialize packages
import numpy as np
import numpy.random as npr
import matplotlib.pyplot as plt
import seaborn as sea
import pandas as pd

# Set seed for random number generator to ensure the same "random" numbers are generated each time we run this code.
np.random.seed(10)

# Set values
beta = 1
N = 100

# Generate J random draws with an exponential distribution
X = npr.exponential(1,size=N)

# Generate J random draws with an normal distribution
eps = npr.normal(0,1,size=N)

# Determine Y
Y = X*beta + eps

# Create Figure
fig, ax = plt.subplots(figsize=(10,5))

ax.scatter(Y,X, alpha = 0.4, color = 'red')

ax.set_xlabel('Independent Variable (X)',size=18)
ax.set_ylabel('Dependent Variable (Y)',size=18)
ax.set_title(r'$Y_j = \beta X_j + \epsilon_j,\;\beta$='+str(beta),size=20)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.show()

Let's pretend that we don't know the true $\beta$ is equal to one.

We want to find the $\beta$ which best fits the data. Since our assumed $f()$ is linear, this amounts to drawing a line through the above dots, beginning at the origin. We this line to go through as many of the dots as possible. In elementary school you would have just positioned your ruler at (0,0) and rotated it unytil you found a good enough line. We're going to do the same thing here but with math.

Finding the 'best' line amounts to finding the line which minimizes the (squared) distance between the line and the red dots. Mathematically, this amounts to solving

$$\text{min}_{\beta}\sum_j \bigg[Y_j - \underbrace{\big(X_j\beta+\epsilon_j\big)}_{\text{Model}}\bigg]^2$$

We squared the distance because we're indifferent between whether the line misses a red dot from above or below. Also note that the above specification lends itself to the name of this regression: "Least Squares". The "ordinary" refers to this approach being the starting point for regression models.

We can solve the $\beta$ using calculus; i.e., by taking a derivative and setting equal to zero:

$$ \sum_j \bigg[Y_j - \big(X_j\beta+\epsilon_j\big)\bigg]\times X_j=0$$

which is equivalent to

$$ \sum_j Y_jX_j - X_jX_j\beta -X_j\epsilon_j=0$$$$ \hookrightarrow \sum_j Y_jX_j -\sum_jX_j\epsilon_j=\beta\sum_jX_jX_j$$

But we can do better yet. Remember, we've assumed in our specification that $\epsilon$ is random noise. Since it's random noise, it cannot be correlated with $X$ so we know the interaction between the two ($\sum_jX_j\epsilon_j$) is equal to zero in expectation (or equivalently when the number of observations are suffiicently high).

We therefore can derive the value of $\beta$ directly from the First-Order Condition:

$$\hat\beta=\frac{\sum_j Y_jX_j}{\sum_jX_jX_j}$$

We can check this with our graph:

In [2]:

numerator = np.sum(Y*X)
denominator = np.sum(X*X)
bhat = numerator/denominator
Yhat = X*bhat 

# Create Figure
fig, ax = plt.subplots(figsize=(10,5))

ax.scatter(Y,X, alpha = 0.4, color = 'red')
ax.plot(Yhat,X, color = 'blue')

ax.set_xlabel('Independent Variable (X)')
ax.set_ylabel('Dependent Variable (Y)')
ax.set_title(r'Proposed DGP with $\beta$='+str(beta),size=20)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.annotate(fr'$\hat\beta=${bhat:.2f}', (2.2,3.5),size=16, color='blue')

plt.show()

Well, we did pretty good at recovering the true $\beta$ but didn't get it exactly. Why not? Well, not that above I said "in expectation" regarding the interaction between $X$ and $\epsilon$ and set it to zero. If that interaction wasn't exactly zero, my calculated $\hat\beta$ will be off. hence, our work is just an estimate of the coefficient, and we would therefore like to understand the variance of this estimate to quantify our uncertainty and possibly to perform significance tests.

We can actually derive an explicit function to represent the variance of our estimate -- where our uncertainty will stem from how good and how much $X$ we have. Mathematically, define the variance as

$$V(\hat\beta)= \frac{1}{N}\sum_j \bigg[\big(\beta-\hat\beta\big)X_j\bigg]^2$$$$\hookrightarrow V(\hat\beta)= \frac{\sigma^2}{\sum_j X_j^2} $$

where $\sigma^2 = \frac{\sum_j\epsilon_j^2}{N-p}$ and $p$ is the 'number of parameters estimated'. We refer to the denominator as the 'degrees of freedom.' We don't know the true epsilon, but our estimated equation gives us an estimate so use that to approximate the true value. Hence, our estimated variance of the estimate is

$$V(\hat\beta)= \frac{\sum_j\hat\epsilon_j^2}{\sum_j X_j^2} $$

We define the standard error as the square root of $V(\hat\beta)$ and we can easilly compute this given our estimation. Moreover, we can test whether our estimate of $\beta$ is significantly different to a 'null hypothesis' (typically set as zero meaning no relationship exists and it could be that $\beta=0$) using a t-test:

$$t = \frac{\hat\beta-\beta^0}{\text{se}(\hat\beta)}\sim T_{N-p}$$

where $T_{N-p}$ represents the T-distribution with 'N-p' degrees of freedom. BTW the story behind the 'student T-test' is a worthwhile read. A good rule-of-thumb is for the t-statistic to be greater than 2 which says that we can reject the null hypothesis with 95\% probability. In practice, we only report the estimated standard errors since that enables the reader to compute t-statistics for a variety of null hypotheses.

In [3]:

resid = Y-X*bhat
s =  np.sum(resid**2)/(N-1)
var = s/(np.sum(X**2))

se = np.sqrt(var)

# Create Figure
fig, ax = plt.subplots(figsize=(10,5))

ax.scatter(Y,X, alpha = 0.4, color = 'red')
ax.plot(Yhat,X, color = 'blue')

ax.set_xlabel('Independent Variable (X)')
ax.set_ylabel('Dependent Variable (Y)')
ax.set_title(r'Proposed DGP with $\beta$='+str(beta),size=20)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.annotate(fr'$\hat\beta=${bhat:.2f} ({se:.4f})', (1.6,3.5),size=16, color='blue')

plt.show()

From the graph, we can see that our simple linear model replicates the data well and -- since we know the true DGP -- also recovers the true value of $\beta$.

In our simple example we only included on variable and therefore only solved for one coefficient. Going forward, our data will be much more complicated (e.g., more independent X variables) and we'll want to account for the possibility that our dependent variable attains some value independent of the observed X variables in our data -- we'll want to include a constant in our $f()$ specification. It turns out that doing all of this is pretty straight-forward and the intution gained from our simple example extends easilly.

Practice ¶

Increase N to $\{200,500,1000\}$. What happens to the estimated coefficient $(\hat\beta)$ and the estimated standard errors $(\hat \epsilon)$?

In [4]:

# Set values
beta = 1

Nval = [100, 200, 500, 1000, 100000, 100000]
for n in Nval:
    # Generate J random draws with an exponential distribution
    X = npr.exponential(1,size=n)

    # Generate J random draws with an normal distribution
    eps = npr.normal(0,1,size=n)

    # Determine Y
    Y = X*beta + eps

    numerator = np.sum(Y*X)
    denominator = np.sum(X*X)
    print(f'Given N of {n:d}, the value of bhat is: {numerator/denominator:.4f}')

Given N of 100, the value of bhat is: 1.0222
Given N of 200, the value of bhat is: 0.9870
Given N of 500, the value of bhat is: 0.9854
Given N of 1000, the value of bhat is: 0.9786
Given N of 100000, the value of bhat is: 0.9995
Given N of 100000, the value of bhat is: 1.0013

Change the specification by changing $\beta$ to a value of your choice and re-run the code to find $\hat\beta$.

In [5]:

Nval = [100, 200, 500, 1000, 100000, 100000]
for n in Nval:
    # Generate J random draws with an exponential distribution
    X = npr.exponential(1,size=n)

    # Generate J random draws with an normal distribution
    eps = npr.normal(0,1,size=n)

    # Determine Y
    Y = X*1.5 + eps

    numerator = np.sum(Y*X)
    denominator = np.sum(X*X)
    print(f'Given N of {n:d}, the value of bhat is: {numerator/denominator:.4f}')

Given N of 100, the value of bhat is: 1.4653
Given N of 200, the value of bhat is: 1.4602
Given N of 500, the value of bhat is: 1.4470
Given N of 1000, the value of bhat is: 1.4435
Given N of 100000, the value of bhat is: 1.4994
Given N of 100000, the value of bhat is: 1.4965

2. Multivariate OLS Regression (top)¶

The Data¶

Reading Stata data files¶

If you took econometrics, you probably used one of Wooldridge's textbooks. I figure you would like to relive those happy times, so we'll use some examples from his textbook.

On the plus side, the data that correspond to the Wooldridge Introductory Econometrics: A Modern Approach problems are available to download and they are ALREADY CLEANED. [I contemplated adding some junk to the files to make it more interesting...]

On the minus side, the files are in STATA's .dta format.

Lucky for use, pandas has a method that reads stata files. It also has methods for csv, excel, json, SQL, SAS,...

We'll be looking at the connection between labor wages and sleep using a 1990 paper by Biddle and Hamermesh (link). The authors use time-use data from 12 countries to assess whether an increase in wage causes individuals to sacrifice sleep.

Take note of a couple things:

The "economic mechanism" is that people are time-constrained so when their wages go up, they sacrifice sleep to earn more. If we were to write down a consumer optimization problem, utility would be increasing in sleep and consumption; ie, $$ max_{s,c} \,U(\underbrace{T-n}_{\text{sleep}},c)\;\;\text{s.t.}\; T \geq n + s, \underbrace{w\times n}_{\text{income}}\geq \underbrace{p\times c}_{\substack{\text{Money spent}\\ \text{on cons.}}} $$
They want to show "causation" but all we'll be able to show with OLS is correlation.

Let's load the underlying data:

In [6]:

# Use pandas read_stata method to get the stata formatted data file into a DataFrame.
sleep = pd.read_stata('./Data/sleep75.dta')

# Take a look...so clean!
sleep.head()

Out[6]:

	age	case	educ	earns74	gdhlth	inlf	leis1	...	spwrk75	totwrk	worknrm	exper	yrsmarr	hrwage	agesq
0	32.0	1.0	12.0	0.0	0.0	1.0	3529.0	...	0.0	3438.0	3438.0	14.0	13.0	7.070004	1024.0
1	31.0	2.0	14.0	9500.0	1.0	1.0	2140.0	...	0.0	5020.0	5020.0	11.0	0.0	1.429999	961.0
2	44.0	3.0	17.0	42500.0	1.0	1.0	4595.0	...	1.0	2815.0	2815.0	21.0	0.0	20.530001	1936.0
3	30.0	4.0	12.0	42500.0	1.0	1.0	3211.0	...	1.0	3786.0	3786.0	12.0	12.0	9.619998	900.0
4	64.0	5.0	14.0	2500.0	1.0	1.0	4052.0	...	1.0	2580.0	2580.0	44.0	33.0	2.750000	4096.0

5 rows × 34 columns

Here is some additional information about the variables:

Variable	Description	Variable	Description	Variable	Description	Variable	Description
age	in years	leis1	sleep - totwrk	rlxall	slpnaps + personal activs	worknrm	mins work main job
black	=1 if black	leis2	slpnaps - totwrk	selfe	=1 if self employed	workscnd	mins work second job
case	identifier	leis3	rlxall - totwrk	sleep	mins sleep at night, per wk	exper	age - educ - 6
clerical	=1 if clerical worker	smsa	=1 if live in smsa	slpnaps	minutes sleep, inc. naps	yngkid	=1 if children < 3 present
construc	=1 if construction worker	lhrwage	log hourly wage	south	=1 if live in south	yrsmarr	years married
educ	years of schooling	lothinc	log othinc, unless othinc < 0	spsepay	spousal wage income	hrwage	hourly wage
earns74	total earnings, 1974	male	=1 if male	spwrk75	=1 if spouse works	agesq	age^2
gdhlth	=1 if in good or excel. health	marr	=1 if married	totwrk	mins worked per week
inlf	=1 if in labor force	prot	=1 if Protestant	union	=1 if belong to union

Describing models with patsy¶

The patsy package provides us with a formulaic syntax for defining models that uses strings. The basic syntax is

y ~ x1 + x2

which describes the model

$y = \beta_0 + \beta_1x_1 + \beta_2x_2 + \epsilon$.

Notice that I did not specify the constant. Patsy takes care of that automatically. Let's start slow and build up the regression, then we will see how to do it all in one shot. You'll have to install patsy with pip:

pip install patsy

We start by passing our model, specified in a patsy string to patsy.dmatrices() to create the design matrices. Our model is

$$ sleep = \beta_0 + \beta_1 totwrk + \beta_2 educ + \beta_3 age + \epsilon $$

NB, This is Wooldridge problem 3, chapter 3.

In [7]:

import patsy                           # provides a syntax for specifying models  

# Pass the model formula and the associated data to create design matrices
y, X = patsy.dmatrices('sleep ~ totwrk + educ + age', sleep)

In [8]:

# What do we have?
print('X and y are of type:' , type(X), type(y))

X and y are of type: <class 'patsy.design_info.DesignMatrix'> <class 'patsy.design_info.DesignMatrix'>

In [9]:

Out[9]:

DesignMatrix with shape (706, 4)
  Intercept  totwrk  educ  age
          1    3438    12   32
          1    5020    14   31
          1    2815    17   44
          1    3786    12   30
          1    2580    14   64
          1    1205    12   41
          1    2113    12   35
          1    3608    13   47
          1    2353    17   32
          1    2851    15   30
          1    6415     8   43
          1     370    16   23
          1    2438    16   24
          1    2693     5   48
          1    2526    12   33
          1    2950    12   23
          1    3003    17   46
          1    4011    14   37
          1    2300    12   53
          1    1543    17   45
          1    3473    17   46
          1    3276    13   40
          1    2506    12   53
          1    2651    13   29
          1    4580    12   29
          1    3588    12   53
          1    3418    13   28
          1    2250    12   35
          1    2638    12   36
          1    3173    12   59
  [676 rows omitted]
  Terms:
    'Intercept' (column 0)
    'totwrk' (column 1)
    'educ' (column 2)
    'age' (column 3)
  (to view full data, use np.asarray(this_obj))

In [10]:

Out[10]:

DesignMatrix with shape (706, 1)
  sleep
   3113
   2920
   2670
   3083
   3448
   4063
   3180
   2928
   3368
   3018
   1575
   3295
   3798
   3008
   3248
   3683
   3201
   2580
   3420
   3090
   2760
   2880
   3470
   2673
   2820
   2873
   1905
   2926
   2603
   3238
  [676 rows omitted]
  Terms:
    'sleep' (column 0)
  (to view full data, use np.asarray(this_obj))

So X holds the independent variables and y holds the dependent variable. Note the addition of the intercept (the column of ones) to the X matrix.

Building and estimating the model in `statsmodels`¶

With the design matrices in-hand, we can build ordinary least squares (OLS) model in the python package statsmodels which you can install via

pip install statsmodels

In [11]:

import statsmodels.api as sm           # provides statistical models like ols, gmm, anova, etc...

# Pass design matrices to OLS to specifyan OLS model
sleep_model = sm.OLS(y, X)
type(sleep_model)

Out[11]:

statsmodels.regression.linear_model.OLS

We can now estimate the model using the .fit( ) method of the statsmodel object.

In [12]:

res = sleep_model.fit()  # Estimate the model and store the results in res

# To see the summary report
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sleep   R-squared:                       0.113
Model:                            OLS   Adj. R-squared:                  0.110
Method:                 Least Squares   F-statistic:                     29.92
Date:                Tue, 01 Nov 2022   Prob (F-statistic):           3.28e-18
Time:                        15:13:19   Log-Likelihood:                -5263.1
No. Observations:                 706   AIC:                         1.053e+04
Df Residuals:                     702   BIC:                         1.055e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   3638.2453    112.275     32.405      0.000    3417.810    3858.681
totwrk        -0.1484      0.017     -8.888      0.000      -0.181      -0.116
educ         -11.1338      5.885     -1.892      0.059     -22.687       0.420
age            2.1999      1.446      1.522      0.129      -0.639       5.038
==============================================================================
Omnibus:                       68.731   Durbin-Watson:                   1.943
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              185.551
Skew:                          -0.496   Prob(JB):                     5.11e-41
Kurtosis:                       5.308   Cond. No.                     1.66e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.66e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

The more you work, the less you sleep. Or is it the other way around. I feel better already.

We can retrieve other results too.

In [13]:

print('The estimated coefficients are:', res.params, '\n')
print('The confidence intervals are:\n', res.conf_int(), '\n')
print(f'The r-squared is: {res.rsquared:.4f}')

The estimated coefficients are: [ 3.63824531e+03 -1.48373438e-01 -1.11338131e+01  2.19988481e+00] 

The confidence intervals are:
 [[ 3.41781008e+03  3.85868055e+03]
 [-1.81148684e-01 -1.15598192e-01]
 [-2.26872877e+01  4.19661463e-01]
 [-6.38561308e-01  5.03833093e+00]] 

The r-squared is: 0.1134

Directly specifying and estimating models with the formula.api¶

We have built our model from the ground up

Create the design matrices with patsy
Create the model with statsmodel
Fit the model and obtain results

That was helpful to get a sense of what is going on, but we can do all those steps in one line of code. We just pass the patsy string and the data directly to statsmodels and call fit.

To do this, we use the statsmodels.formula.api method. The formula api interprets the patsy formula for us and creates the design matrices.

In [14]:

import statsmodels.formula.api as smf  # provides a way to directly spec models from formulas

res = smf.ols('sleep ~ totwrk + educ + age', data=sleep).fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sleep   R-squared:                       0.113
Model:                            OLS   Adj. R-squared:                  0.110
Method:                 Least Squares   F-statistic:                     29.92
Date:                Tue, 01 Nov 2022   Prob (F-statistic):           3.28e-18
Time:                        15:13:19   Log-Likelihood:                -5263.1
No. Observations:                 706   AIC:                         1.053e+04
Df Residuals:                     702   BIC:                         1.055e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   3638.2453    112.275     32.405      0.000    3417.810    3858.681
totwrk        -0.1484      0.017     -8.888      0.000      -0.181      -0.116
educ         -11.1338      5.885     -1.892      0.059     -22.687       0.420
age            2.1999      1.446      1.522      0.129      -0.639       5.038
==============================================================================
Omnibus:                       68.731   Durbin-Watson:                   1.943
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              185.551
Skew:                          -0.496   Prob(JB):                     5.11e-41
Kurtosis:                       5.308   Cond. No.                     1.66e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.66e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Data transformations¶

Patsy can handle common (and less common) regression tasks. Here are a few

Interacting variables¶

Use '*' to interact two variables. Patsy will also include the two variables individually, too. Below, we interact education an age

$$ sleep = \beta_0 + \beta_1 totwrk + \beta_2 educ + \beta_3 age + \beta_4 age\times educ + \epsilon $$

In [15]:

res = smf.ols('sleep ~ totwrk + educ*age', data=sleep).fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sleep   R-squared:                       0.119
Model:                            OLS   Adj. R-squared:                  0.114
Method:                 Least Squares   F-statistic:                     23.67
Date:                Tue, 01 Nov 2022   Prob (F-statistic):           2.24e-18
Time:                        15:13:19   Log-Likelihood:                -5260.9
No. Observations:                 706   AIC:                         1.053e+04
Df Residuals:                     701   BIC:                         1.055e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   3081.4242    286.426     10.758      0.000    2519.069    3643.780
totwrk        -0.1444      0.017     -8.618      0.000      -0.177      -0.112
educ          31.5246     21.032      1.499      0.134      -9.769      72.818
age           14.9293      6.197      2.409      0.016       2.763      27.096
educ:age      -1.0065      0.477     -2.112      0.035      -1.942      -0.071
==============================================================================
Omnibus:                       66.086   Durbin-Watson:                   1.933
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              178.298
Skew:                          -0.475   Prob(JB):                     1.92e-39
Kurtosis:                       5.271   Cond. No.                     4.32e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 4.32e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Using logs¶

We use the np.log( ) method directly in the patsy syntax. Note that we loaded the numpy package above as np. Below, we use the logarithm of age in the model

$$ sleep = \beta_0 + \beta_1 totwrk + \beta_2 educ + \beta_3 \log(age) + \epsilon $$

In [16]:

res = smf.ols('sleep ~ totwrk + educ + np.log(age)', data=sleep).fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sleep   R-squared:                       0.113
Model:                            OLS   Adj. R-squared:                  0.109
Method:                 Least Squares   F-statistic:                     29.79
Date:                Tue, 01 Nov 2022   Prob (F-statistic):           3.91e-18
Time:                        15:13:19   Log-Likelihood:                -5263.3
No. Observations:                 706   AIC:                         1.053e+04
Df Residuals:                     702   BIC:                         1.055e+04
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept    3440.9308    239.448     14.370      0.000    2970.811    3911.050
totwrk         -0.1489      0.017     -8.924      0.000      -0.182      -0.116
educ          -11.3493      5.881     -1.930      0.054     -22.896       0.197
np.log(age)    79.2414     56.612      1.400      0.162     -31.908     190.391
==============================================================================
Omnibus:                       68.734   Durbin-Watson:                   1.944
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              184.854
Skew:                          -0.497   Prob(JB):                     7.24e-41
Kurtosis:                       5.301   Cond. No.                     3.61e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 3.61e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Fixed effects¶

When I was a kid, we called these dummy variables. Gender is coded {0,1} in the variable male. Let's suppose it was coded as a string.

In [17]:

sleep['gender'] = sleep['male'].replace({1.0:'male', 0.0:'female'})

In [18]:

res = smf.ols('sleep ~ totwrk + educ + age + C(gender)', data=sleep).fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sleep   R-squared:                       0.122
Model:                            OLS   Adj. R-squared:                  0.117
Method:                 Least Squares   F-statistic:                     24.26
Date:                Tue, 01 Nov 2022   Prob (F-statistic):           8.02e-19
Time:                        15:13:19   Log-Likelihood:                -5259.8
No. Observations:                 706   AIC:                         1.053e+04
Df Residuals:                     701   BIC:                         1.055e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
Intercept          3642.4666    111.844     32.567      0.000    3422.877    3862.056
C(gender)[T.male]    87.9933     34.323      2.564      0.011      20.604     155.382
totwrk               -0.1658      0.018     -9.230      0.000      -0.201      -0.131
educ                -11.7561      5.866     -2.004      0.045     -23.274      -0.238
age                   1.9643      1.443      1.361      0.174      -0.869       4.797
==============================================================================
Omnibus:                       65.308   Durbin-Watson:                   1.942
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              174.107
Skew:                          -0.473   Prob(JB):                     1.56e-38
Kurtosis:                       5.241   Cond. No.                     1.66e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.66e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Including a fixed effect for gender means that we interpret the coefficient on the dummy as relative to the reference group (here, women); ie, since the coefficient is positive, men sleep more than women, ceteris paribus. Equivalently, the intercept is the value of the female fixed effect.

No intercept¶

Use -1 to kill the automatic intercept and force the regression to go through the origin (as in our simple example).

In [19]:

res = smf.ols('sleep ~ totwrk + educ + age + C(gender) -1', data=sleep).fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  sleep   R-squared:                       0.122
Model:                            OLS   Adj. R-squared:                  0.117
Method:                 Least Squares   F-statistic:                     24.26
Date:                Tue, 01 Nov 2022   Prob (F-statistic):           8.02e-19
Time:                        15:13:19   Log-Likelihood:                -5259.8
No. Observations:                 706   AIC:                         1.053e+04
Df Residuals:                     701   BIC:                         1.055e+04
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
=====================================================================================
                        coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------------
C(gender)[female]  3642.4666    111.844     32.567      0.000    3422.877    3862.056
C(gender)[male]    3730.4598    117.475     31.755      0.000    3499.816    3961.104
totwrk               -0.1658      0.018     -9.230      0.000      -0.201      -0.131
educ                -11.7561      5.866     -2.004      0.045     -23.274      -0.238
age                   1.9643      1.443      1.361      0.174      -0.869       4.797
==============================================================================
Omnibus:                       65.308   Durbin-Watson:                   1.942
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              174.107
Skew:                          -0.473   Prob(JB):                     1.56e-38
Kurtosis:                       5.241   Cond. No.                     2.37e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.37e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

Practice ¶

Wooldridge problem C2 in chapter 6.

Load wage1.dta

Generate a scatter plot of log(wage) on education.

In [20]:

wage1 = pd.read_stata('./Data/wage1.dta')

# There is already a log(wage) variable in the data set (lwage) but we can always use some practice with transforms
wage1['lnwage'] = wage1['wage'].apply(np.log)

wage1.head()

Out[20]:

	wage	educ	exper	tenure	female	married	numdep	smsa	...	trade	services	clerocc	servocc	lwage	expersq	tenursq	lnwage
0	3.10	11.0	2.0	0.0	1.0	0.0	2.0	1.0	...	0.0	0.0	0.0	0.0	1.131402	4.0	0.0	1.131402
1	3.24	12.0	22.0	2.0	1.0	1.0	3.0	1.0	...	0.0	1.0	0.0	1.0	1.175573	484.0	4.0	1.175573
2	3.00	11.0	2.0	0.0	0.0	0.0	2.0	0.0	...	1.0	0.0	0.0	0.0	1.098612	4.0	0.0	1.098612
3	6.00	8.0	44.0	28.0	0.0	1.0	0.0	1.0	...	0.0	0.0	1.0	0.0	1.791759	1936.0	784.0	1.791759
4	5.30	12.0	7.0	2.0	0.0	1.0	1.0	0.0	...	0.0	0.0	0.0	0.0	1.667707	49.0	4.0	1.667707

5 rows × 25 columns

In [21]:

fig, ax = plt.subplots(figsize=(10,6))

ax.scatter(wage1['educ'], wage1['lnwage'], marker='o', alpha = 0.5 )

ax.set_xlabel('education')
ax.set_ylabel('log wage')

sea.despine(ax=ax, trim=True)

What's the relationship between education and log-wage? Estimate
$$ \log(wage) = \beta_0 + \beta_1 educ + \epsilon$$

Print your estimated coefficient on education. Is it statistically significant from zero? What's the interpretation?

In [22]:

res = smf.ols('np.log(wage) ~ educ', data=wage1).fit()
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           np.log(wage)   R-squared:                       0.186
Model:                            OLS   Adj. R-squared:                  0.184
Method:                 Least Squares   F-statistic:                     119.6
Date:                Tue, 01 Nov 2022   Prob (F-statistic):           3.27e-25
Time:                        15:13:19   Log-Likelihood:                -359.38
No. Observations:                 526   AIC:                             722.8
Df Residuals:                     524   BIC:                             731.3
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.5838      0.097      5.998      0.000       0.393       0.775
educ           0.0827      0.008     10.935      0.000       0.068       0.098
==============================================================================
Omnibus:                       11.804   Durbin-Watson:                   1.801
Prob(Omnibus):                  0.003   Jarque-Bera (JB):               13.811
Skew:                           0.268   Prob(JB):                      0.00100
Kurtosis:                       3.586   Cond. No.                         60.2
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Scatter plot the residuals against education. The residuals are in the results object from the fit method.

In [23]:

fig, ax = plt.subplots(figsize=(10,6))

ax.scatter(wage1['educ'], res.resid, marker='o', alpha = 0.5 )

ax.set_xlabel('education')
ax.set_ylabel('residual')

sea.despine(ax=ax, trim=True)

Looks heteroskedastic. That's s word to impress your friends. It means the errors don't look like they're drawn from a single distribution. They're still normal and noise so our math and $\beta$ point-estimates are valid -- the errors may just be coming from a different distribution depending on whether someone is of high or low education. The consequence of heteroskedasticity is that our standard errors and corresponding hypothesis tests are wrong.

This is actually an easy fix (mathematically and programatically -- not a word I think) and we can change the covariance matrix type (which will correct the standard error calculation) using the cov_type parameter (docs). The types of covariance matrices are described in the docs.

Redo the estimation allowing for a more flexible variance-covariance matrix. Try 'HC3' for your covariance matrix type.

In [24]:

res = smf.ols('np.log(wage) ~ educ', data=wage1).fit(cov_type='HC3')
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           np.log(wage)   R-squared:                       0.186
Model:                            OLS   Adj. R-squared:                  0.184
Method:                 Least Squares   F-statistic:                     111.7
Date:                Tue, 01 Nov 2022   Prob (F-statistic):           8.48e-24
Time:                        15:13:19   Log-Likelihood:                -359.38
No. Observations:                 526   AIC:                             722.8
Df Residuals:                     524   BIC:                             731.3
Df Model:                           1                                         
Covariance Type:                  HC3                                         
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.5838      0.099      5.872      0.000       0.389       0.779
educ           0.0827      0.008     10.569      0.000       0.067       0.098
==============================================================================
Omnibus:                       11.804   Durbin-Watson:                   1.801
Prob(Omnibus):                  0.003   Jarque-Bera (JB):               13.811
Skew:                           0.268   Prob(JB):                      0.00100
Kurtosis:                       3.586   Cond. No.                         60.2
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC3)

Scatter plot the data and add the regression line.

To plot the regression line you will need to create some x data and then apply the parameters. I used something like this

y = [p.Intercept + p.educ*i for i in x]

where p holds the parameters from my results and x holds the relavent independent variables from the specification. If you've had linear algebra, this is matric multiplcation between a coefficient vector and the independent X matrix.

In [25]:

fig, ax = plt.subplots(figsize=(15,10))

# Plot the data
ax.scatter(wage1['educ'], wage1['lnwage'], marker='o', alpha = 0.5 )

# Create the line of best fit to plot
p = res.params                                # params from the model fit
x = range(0,20)                               # some x data
y = [p.Intercept + p.educ*i for i in x]       # apply the coefficients 

ax.plot(x,y, color='black')

# build the string
text = 'regression line: \nlog(w) = {0:.2} + {1:.2} educ'.format(p.Intercept, p.educ)

# place the annotation
ax.text(0.5, 2.5, text, fontsize=14)

ax.set_xlabel('education', fontsize=14)
ax.set_ylabel('log wage', fontsize=14)

sea.despine(ax=ax, trim=True)

plt.show()

Go back and add the text 'log(w) = a + b*educ' to the northwest corner of your plot. Replace a and b with the estimated parameter values.

In [26]:

# See above.

Let's modify the regression to include education in logs.

$$ \log(wage) = \beta_0 + \beta_1 \log(educ)+\epsilon$$

Estimate the model and interpret your results. You may have to drop a couple observations with where education is zero (or add a one to all education numbers in the data) in order to generate nice output.

Hint: Totall differentiate our specification to get $\frac{dw}{w}=\beta_1\times\frac{de}{e}$. Rearrange such that $\beta_1 = \frac{dw}{de}\times \frac{e}{w}.$ What is the economic term for the RHS of that expression?

In [27]:

wage1= wage1[wage1['educ']>0]

res = smf.ols('np.log(wage) ~ np.log(educ)', data=wage1,missing='drop').fit(cov_type='HC3')
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           np.log(wage)   R-squared:                       0.149
Model:                            OLS   Adj. R-squared:                  0.147
Method:                 Least Squares   F-statistic:                     53.97
Date:                Tue, 01 Nov 2022   Prob (F-statistic):           7.89e-13
Time:                        15:13:20   Log-Likelihood:                -370.08
No. Observations:                 524   AIC:                             744.2
Df Residuals:                     522   BIC:                             752.7
Df Model:                           1                                         
Covariance Type:                  HC3                                         
================================================================================
                   coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------
Intercept       -0.4447      0.283     -1.572      0.116      -0.999       0.110
np.log(educ)     0.8252      0.112      7.346      0.000       0.605       1.045
==============================================================================
Omnibus:                       11.725   Durbin-Watson:                   1.796
Prob(Omnibus):                  0.003   Jarque-Bera (JB):               13.046
Skew:                           0.287   Prob(JB):                      0.00147
Kurtosis:                       3.517   Cond. No.                         29.6
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC3)

The education **elasticity** of wage is 0.8252.

Let's add some additional regressors (aka RHS variables, independent variables, covariates). Estimate $$ \log(wage) = \beta_0 + \beta_1 educ + \beta_2 exper + \beta_3 exper^2 + \beta_4I_m + \epsilon$$

where $I_m$ is a variable equal to 1 if the worker is a married. Interpret your results.

In [28]:

res = smf.ols('np.log(wage) ~ educ + exper + np.power(exper,2) + C(married)', data=wage1).fit(cov_type='HC3')
print(res.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:           np.log(wage)   R-squared:                       0.314
Model:                            OLS   Adj. R-squared:                  0.309
Method:                 Least Squares   F-statistic:                     57.67
Date:                Tue, 01 Nov 2022   Prob (F-statistic):           2.92e-40
Time:                        15:13:20   Log-Likelihood:                -313.47
No. Observations:                 524   AIC:                             636.9
Df Residuals:                     519   BIC:                             658.2
Df Model:                           4                                         
Covariance Type:                  HC3                                         
======================================================================================
                         coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------
Intercept              0.0869      0.108      0.807      0.420      -0.124       0.298
C(married)[T.1.0]      0.1217      0.044      2.789      0.005       0.036       0.207
educ                   0.0921      0.008     11.693      0.000       0.077       0.108
exper                  0.0343      0.005      6.354      0.000       0.024       0.045
np.power(exper, 2)    -0.0006      0.000     -5.166      0.000      -0.001      -0.000
==============================================================================
Omnibus:                        5.733   Durbin-Watson:                   1.793
Prob(Omnibus):                  0.057   Jarque-Bera (JB):                7.698
Skew:                           0.046   Prob(JB):                       0.0213
Kurtosis:                       3.587   Cond. No.                     4.36e+03
==============================================================================

Notes:
[1] Standard Errors are heteroscedasticity robust (HC3)
[2] The condition number is large, 4.36e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Wage is increasing in marriage, education, and experience. The marginal benefit of experience is decreasing, however.

Lecture 20: Linear Regression via Ordinary Least Squares (OLS) [MY SOLUTIONS]

Class Announcements

1. Motivating OLS Regressions (top)¶

Inference (Econometrics) vs. Prediction (Machine Learning)¶

Specifying f( )¶

Identifying $\beta$: A Motivating Simple Example¶

Monte Carlo Simulation¶

Practice ¶

2. Multivariate OLS Regression (top)¶

The Data¶

Reading Stata data files¶

Describing models with patsy¶

Building and estimating the model in statsmodels¶

Directly specifying and estimating models with the formula.api¶

Data transformations¶

Interacting variables¶

Using logs¶

Fixed effects¶

No intercept¶

Practice ¶

Jeff Thurk // jeff.thurk@uga.edu // Department of Economics // University of Georgia

Lecture 20: Linear Regression via Ordinary Least Squares (OLS)
[MY SOLUTIONS]

Building and estimating the model in `statsmodels`¶