Lecture 21: OLS Regression with Discrete Dependent Variables
[MY SOLUTIONS]
We continue to learn about the statsmodels package (docs), which provides functions for formulating and estimating statistical models. In this notebook we take on models in which the dependent variable is discrete. In the examples below, the dependent variable is binary (which makes it easier to visualize). At the end of the lecture, we extend the analysis to dependent variables with many discrete values.
Here is a nice overview of the discrete choice models in statsmodels.
The agenda for today's lecture is as follows:
Class Announcements
None.
1. Math Primer (top)¶
So far we've been dealing with continuous dependant (ie, LHS) variables such as hours worked. A lot of outcomes we observe and are interested in are not continuous, however. For example, labor force participation in the United States is roughly 65% so the choice of whether to work or not appears to be a significant one.
Suppose our dependant variable Y is binary (ie, zero or one). For example, Y may represent presence/absence of a certain condition, success/failure of some device, answer yes/no on a survey, etc. We also have a vector of regressors X which we think influence Y. As before, suppose we also have an error term $\epsilon$ which is distributed from some distribution. Define $Y^*$ as some latent (ie, we can't actually observe) variable where
$$ Y^*= X\beta + \epsilon $$and we think $Y=1$ whenever $Y^*>0$, or equivalently whenever $X\beta + \epsilon>0$. Define $P(Y=1|X)$ as the 'probability Y is equal to one conditional on the variables X.' It follows then that
$$ P(Y=1|X)= P(Y^*>0) $$$$ \Rightarrow P(\epsilon < X\beta) $$2. Probit Regression (top)¶
Note that $P(\epsilon < X\beta)$ is the definition of a CDF. Suppose we specify that $\epsilon$ is drawn iid from a standard Normal distribution. With this added assumption, we can do a lot more:
$$ P(Y=1|X)= \Phi(X\beta) $$where $\Phi()$ is the CDF for the standard Normal distribution. The likelihood we observe a single observation ($Y_j=1$ or $Y_j=0$) is therefore
$$ \mathcal{L}(\beta;y_j,x_j)= \Phi(x_j\beta)^{y_j}\times(1-\Phi(x_j\beta))^{1-y_j}. $$The first part ($\Phi(X\beta)^{y_j}$) gets turned on when $y_j=1$ while the second part gets turned on when $y_j=0$. We can therefore solve for the $\beta$ vector to best match the data by maximizing the 'likelihood' function; ie,
$$ \mathcal{L}(\beta;Y,X)= \Pi_{j=1}^J\Phi(x_j\beta)^{y_j}\times(1-\Phi(x_j\beta))^{1-y_j} $$This looks complicated but it's really just a simple maxmization problem like OLS.
An Example: Gambling¶
When we're talking probability, there is no better example than gambling. Actually, gambling is the source (inspiration?) for a lot of the probability theory we have today. Since relativity and quantum mechanics use probability heavilly, let's attribute that to gambling too.
The file 'pntsprd.dta' contains data about vegas betting. The complete variable list is here. We will use favwin
which is equal to 1 if the favored team won and zero otherwise and spread
which holds the betting spread. In this context, a spread is the number of points that the favored team must beat the unfavored team by in order to be counted as a win by the favored team.
import pandas as pd # for data handling
import numpy as np # for numerical methods and data structures
import matplotlib.pyplot as plt # for plotting
import seaborn as sea # advanced plotting
import statsmodels.formula.api as smf # provides a way to directly spec models from formulas
# Use pandas read_stata method to get the stata formatted data file into a DataFrame.
vegas = pd.read_stata('./Data/pntsprd.dta')
# Take a look...so clean!
vegas.head()
favscr | undscr | spread | favhome | neutral | fav25 | und25 | fregion | uregion | scrdiff | sprdcvr | favwin | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 72.0 | 61.0 | 7.0 | 0.0 | 0.0 | 1.0 | 0.0 | 3.0 | 4.0 | 11.0 | 1.0 | 1.0 |
1 | 82.0 | 74.0 | 7.0 | 1.0 | 0.0 | 0.0 | 0.0 | 3.0 | 1.0 | 8.0 | 1.0 | 1.0 |
2 | 87.0 | 57.0 | 17.0 | 1.0 | 0.0 | 0.0 | 0.0 | 3.0 | 3.0 | 30.0 | 1.0 | 1.0 |
3 | 69.0 | 70.0 | 9.0 | 1.0 | 0.0 | 0.0 | 0.0 | 3.0 | 3.0 | -1.0 | 0.0 | 0.0 |
4 | 77.0 | 79.0 | 2.5 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 3.0 | -2.0 | 0.0 | 0.0 |
vegas.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 553 entries, 0 to 552 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 favscr 553 non-null float32 1 undscr 553 non-null float32 2 spread 553 non-null float32 3 favhome 553 non-null float32 4 neutral 553 non-null float32 5 fav25 553 non-null float32 6 und25 553 non-null float32 7 fregion 553 non-null float32 8 uregion 553 non-null float32 9 scrdiff 553 non-null float32 10 sprdcvr 553 non-null float32 11 favwin 553 non-null float32 dtypes: float32(12) memory usage: 30.2 KB
fig, ax = plt.subplots(figsize=(15,6))
ax.scatter( vegas['spread'], vegas['favwin'], facecolors='none', edgecolors='red')
ax.set_ylabel('favored team outcome (win = 1, loss = 0)')
ax.set_xlabel('point spread')
ax.set_title('The data from the point spread dataset')
sea.despine(ax=ax)
Estimation¶
We begin with the linear probability model. The model is
$$\text{Pr}(favwin=1 \mid spread) = \beta_0 + \beta_1 spread + \epsilon .$$There is nothing new here technique-wise. Let's start with OLS which is like pretending the Y variable is continuous.
# statsmodels adds a constant for us...
res_ols = smf.ols('favwin ~ spread', data=vegas).fit()
print(res_ols.summary())
OLS Regression Results ============================================================================== Dep. Variable: favwin R-squared: 0.111 Model: OLS Adj. R-squared: 0.109 Method: Least Squares F-statistic: 68.57 Date: Tue, 08 Nov 2022 Prob (F-statistic): 9.32e-16 Time: 09:07:16 Log-Likelihood: -279.29 No. Observations: 553 AIC: 562.6 Df Residuals: 551 BIC: 571.2 Df Model: 1 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 0.5769 0.028 20.434 0.000 0.521 0.632 spread 0.0194 0.002 8.281 0.000 0.015 0.024 ============================================================================== Omnibus: 86.055 Durbin-Watson: 2.112 Prob(Omnibus): 0.000 Jarque-Bera (JB): 94.402 Skew: -0.956 Prob(JB): 3.17e-21 Kurtosis: 2.336 Cond. No. 20.0 ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
Hypothesis testing with t-test¶
If bookies were all-knowing, the spread would exactly account for the predictable winning probability and all we would be left with is the noise --- the intercept should be one-half. Is it true in the data? We can use the t_test( )
method of the results object to perform t-tests.
The null hypothesis is $H_0: \beta_0 = 0.5$ and the alternative hypothesis is $H_1: \beta_0 \neq 0.5$.
t_test = res_ols.t_test('Intercept = 0.5')
print(t_test)
Test for Constraints ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ c0 0.5769 0.028 2.725 0.007 0.521 0.632 ==============================================================================
Linear probability models have some problems. Perhaps the biggest one is that there is no guarantee that the predicted probability lies between zero and one!
We can use the predictedvalues
attribute of the results object to recover the fitted values of the y variables. Let's plot them and take a look.
fig, ax = plt.subplots(figsize=(15,6))
ax.scatter(vegas['spread'], res_ols.fittedvalues, facecolors='none', edgecolors='red')
ax.axhline(y=1.0, color='grey', linestyle='--')
ax.set_ylabel('pedict probability of winning')
ax.set_xlabel('point spread')
ax.set_title('Predicted winning probabilities from an OLS model')
sea.despine(ax=ax, trim=True)
Now, let's account for the discreteness and estimate with probit.
res_probit = smf.probit('favwin ~ spread', data=vegas).fit()
print(res_probit.summary())
Optimization terminated successfully. Current function value: 0.476604 Iterations 6 Probit Regression Results ============================================================================== Dep. Variable: favwin No. Observations: 553 Model: Probit Df Residuals: 551 Method: MLE Df Model: 1 Date: Tue, 08 Nov 2022 Pseudo R-squ.: 0.1294 Time: 09:07:16 Log-Likelihood: -263.56 converged: True LL-Null: -302.75 Covariance Type: nonrobust LLR p-value: 8.521e-19 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -0.0106 0.104 -0.102 0.919 -0.214 0.193 spread 0.0925 0.012 7.591 0.000 0.069 0.116 ==============================================================================
Notice the top: "Optimization terminated successfully..." That's because with probit there is no analytical solution like there is with OLS. Instead, the computer has to maximize the likelihood function by taking a guess for an initial $\beta$ and then iterating using calculus to make smart choices.
The coefficients are very different. Just look at the intercept! That's in large part b/c the coefficients have a different meaning in a probabilistic model. In order to determine the effect on Y, we have to run the coefficient through the distributional assumption, here Normal. When we do this, we call the results 'marginal effects.' The math is pretty straight-forward -- but then again recovering marginal effects is standard stuff so there's a method for that:
margeff = res_probit.get_margeff('mean')
print(margeff.summary())
Probit Marginal Effects ===================================== Dep. Variable: favwin Method: dydx At: mean ============================================================================== dy/dx std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ spread 0.0251 0.003 8.661 0.000 0.019 0.031 ==============================================================================
Okay, so a unit increase in the spread is correlated with a (statistically significant) 2.5% increase in the probability the team wins. Makes sense -- otherwise those bright, shiny Vegas lights wouldn't be so shiny.
Note that the marginal effect calculation required us to take a stand on from where we calculated the derivative. In a linear model like OLS, the derivative is just the coefficients and those are constant. Here, the model is non-linear (b/c of the Normal distribution) so the derivative changes depending on where we choose. The average is the standard though skewed data might make the median more resonable.
Let's take a look at the marginal effects at different points in the data. Note that the reported marginal effect above is located at the intersection of the marginal effects plot and the vertical dashed line indicating the average spread.
from scipy.stats import norm # import functions related to the normal distribution
y = norm.pdf(res_probit.fittedvalues,0,1)*res_probit.params.spread
fig, ax = plt.subplots(figsize=(15,6))
avg_spread = np.mean(vegas['spread'])
# Create the marginal effects
ax.scatter(vegas['spread'],y, color='black', label = 'marg. effects')
ax.set_ylabel('estimated marginal effect')
ax.set_xlabel('point spread')
ax.set_title('plotting marginal effects')
ax.legend(frameon=False,loc='upper right', bbox_to_anchor=(0.9, 0.7), fontsize=14)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.axvline(x=avg_spread, color='red', linestyle='--')
ax.text(avg_spread+.5,0.035,'Average Spread',fontsize=14)
ax.set_ylim([-1e-3,0.04])
plt.show()
Let's look at the predicted values. In OLS this was easy. Here, things are (for some bizarre reason) more complicated -- we have to run the $X\hat\beta$ interactions through the standard Normal distribution ourselves.
pred_probit = norm.cdf(res_probit.fittedvalues,0,1) # Standard Normal (ie, mean = 0, stdev = 1)
Plot the estimated probabilty of the favored team winning and the actual data.
fig, ax = plt.subplots(figsize=(15,6))
ax.scatter(vegas['spread'], pred_probit, facecolors='none', edgecolors='red', label='predicted')
ax.scatter(vegas['spread'], vegas['favwin'], facecolors='none', edgecolors='blue', label = 'data')
ax.axhline(y=1.0, color='grey', linestyle='--')
# Create the line of best fit to plot
p = res_ols.params # params from the OLS model linear probability model
x = range(0,35) # some x data
y = [p.Intercept + p.spread*i for i in x] # apply the coefficients
ax.plot(x,y, color='black', label = 'linear prob.')
ax.set_ylabel('pedict probability of winning')
ax.set_xlabel('point spread')
ax.set_title('Predicted winning probabilities from a probit model')
ax.legend(frameon=False,loc='upper right', bbox_to_anchor=(0.9, 0.7), fontsize=14)
sea.despine(ax=ax, trim=True)
3. Logistic Regression (aka Logit) (top)¶
Our framework is actually pretty flexible so we can use different distributions. The other popular distributional assumption is to assume the $\epsilon$ errors come from a Logistic distribution. Why Logistic? Because the result is a nice simple function for the probability:
$$\text{prob} = \frac{\exp \left({\beta_0+\beta_1 spread}\right)}{1+\exp \left({\beta_0+\beta_1 spread}\right)},$$and we predict a team wins when ever $\text{prob} \ge 0.5$. We estimate the logit model with logit( )
method from smf
in a way similar to probit
.
res_logit = smf.logit('favwin ~ spread', data=vegas).fit()
print(res_logit.summary())
Optimization terminated successfully. Current function value: 0.477218 Iterations 7 Logit Regression Results ============================================================================== Dep. Variable: favwin No. Observations: 553 Model: Logit Df Residuals: 551 Method: MLE Df Model: 1 Date: Tue, 08 Nov 2022 Pseudo R-squ.: 0.1283 Time: 09:07:17 Log-Likelihood: -263.90 converged: True LL-Null: -302.75 Covariance Type: nonrobust LLR p-value: 1.201e-18 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -0.0712 0.173 -0.411 0.681 -0.411 0.268 spread 0.1632 0.023 7.236 0.000 0.119 0.207 ==============================================================================
Again, interpreting logit coefficients is bit more complicated. The probability that a team wins is given by the expression
$$\text{prob} = \frac{\exp \left({\beta_0+\beta_1 spread}\right)}{1+\exp \left({\beta_0+\beta_1 spread}\right)}$$Our marginal effects will hammer $X\hat\beta$ through the above non-linear function to derive the marginal effects. Let's take a look:
margeff = res_logit.get_margeff('mean')
print(margeff.summary())
Logit Marginal Effects ===================================== Dep. Variable: favwin Method: dydx At: mean ============================================================================== dy/dx std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ spread 0.0244 0.003 9.059 0.000 0.019 0.030 ==============================================================================
Let's again plot the estimated probabilty of the favored team winning and the actual data but now let's compare the implications of our distributional assumptions. First, generate predicted values using numpy
and the above expression for the probability.
pred_logit = np.exp(res_logit.fittedvalues) /( 1+np.exp(res_logit.fittedvalues) )
Now, plot probit vs logit:
fig, ax = plt.subplots(figsize=(15,6))
ax.scatter(vegas['spread'], pred_logit, facecolors='none', edgecolors='red', label='predicted-logit')
ax.scatter(vegas['spread'], pred_probit, facecolors='none', edgecolors='black', label='predicted-probit')
ax.scatter(vegas['spread'], vegas['favwin'], facecolors='none', edgecolors='blue', label = 'data')
ax.axhline(y=1.0, color='grey', linestyle='--')
# Create the line of best fit to plot
p = res_ols.params # params from the OLS model linear probability model
x = range(0,35) # some x data
y = [p.Intercept + p.spread*i for i in x] # apply the coefficients
ax.plot(x,y, color='black', label = 'linear prob.')
ax.set_ylabel('pedict probability of winning')
ax.set_xlabel('point spread')
ax.set_title('Predicted winning probabilities from logit and probit models')
ax.legend(frameon=False,loc='upper right', bbox_to_anchor=(0.9, 0.7), fontsize=14)
sea.despine(ax=ax, trim=True)
We observe that the probit and logit models are nearly on top of eachother. That's a common occurrence. In practice, the models are often interchangeable and the practitioner will choose one over the other because in their setting one may have some slightly better properties (e.g., more intuitive intrepretation of the marginal effects).
apples = pd.read_stata('./Data/apple.dta')
apples.head()
id | educ | date | state | regprc | ecoprc | inseason | hhsize | male | faminc | age | reglbs | ecolbs | numlt5 | num5_17 | num18_64 | numgt64 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 10002 | 16 | 111597 | SD | 1.19 | 1.19 | 1 | 4 | 0 | 45 | 43 | 2.0 | 2.000000 | 0 | 1 | 3 | 0 |
1 | 10004 | 16 | 121897 | KS | 0.59 | 0.79 | 0 | 1 | 0 | 65 | 37 | 0.0 | 2.000000 | 0 | 0 | 1 | 0 |
2 | 10034 | 18 | 111097 | MI | 0.59 | 0.99 | 1 | 3 | 0 | 65 | 44 | 0.0 | 2.666667 | 0 | 2 | 1 | 0 |
3 | 10035 | 12 | 111597 | TN | 0.89 | 1.09 | 1 | 2 | 1 | 55 | 55 | 3.0 | 0.000000 | 0 | 0 | 2 | 0 |
4 | 10039 | 15 | 122997 | NY | 0.89 | 1.09 | 0 | 1 | 1 | 25 | 22 | 0.0 | 3.000000 | 0 | 0 | 1 | 0 |
- Create a variable named
ecobuy
that is equal to 1 if the observation has a positive purchase of eco-apples (i.e., ecolbs>0).
# this is only one way to do this...
apples['ecobuy'] = 0 # create the variable and default it to zero
apples.loc[apples['ecolbs']>0, 'ecobuy'] = 1 # set the variable = 1 when positive ecolbs
apples['ecobuy'].describe()
count 660.000000 mean 0.624242 std 0.484685 min 0.000000 25% 0.000000 50% 1.000000 75% 1.000000 max 1.000000 Name: ecobuy, dtype: float64
- Estimate a linear probability model relating the probability of purchasing eco-apples to household characteristics.
apple_res = smf.ols('ecobuy ~ ecoprc + regprc + faminc + hhsize + educ + age', data=apples).fit()
print(apple_res.summary())
OLS Regression Results ============================================================================== Dep. Variable: ecobuy R-squared: 0.110 Model: OLS Adj. R-squared: 0.102 Method: Least Squares F-statistic: 13.43 Date: Tue, 08 Nov 2022 Prob (F-statistic): 2.18e-14 Time: 09:07:17 Log-Likelihood: -419.60 No. Observations: 660 AIC: 853.2 Df Residuals: 653 BIC: 884.6 Df Model: 6 Covariance Type: nonrobust ============================================================================== coef std err t P>|t| [0.025 0.975] ------------------------------------------------------------------------------ Intercept 0.4237 0.165 2.568 0.010 0.100 0.748 ecoprc -0.8026 0.109 -7.336 0.000 -1.017 -0.588 regprc 0.7193 0.132 5.464 0.000 0.461 0.978 faminc 0.0006 0.001 1.042 0.298 -0.000 0.002 hhsize 0.0238 0.013 1.902 0.058 -0.001 0.048 educ 0.0248 0.008 2.960 0.003 0.008 0.041 age -0.0005 0.001 -0.401 0.689 -0.003 0.002 ============================================================================== Omnibus: 4015.360 Durbin-Watson: 2.084 Prob(Omnibus): 0.000 Jarque-Bera (JB): 69.344 Skew: -0.411 Prob(JB): 8.75e-16 Kurtosis: 1.641 Cond. No. 724. ============================================================================== Notes: [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
- How many estimated probabilities are negative? Are greater than one?
fitted = apple_res.fittedvalues # store the fitted values
fitted[(fitted>1) | (fitted<0)] # greater than 1 or less than zero
167 1.070860 493 1.054372 dtype: float64
val = ((fitted>1) | (fitted<0)).astype(float).mean()*100
print(f'Answer: {val:4.2f} percent of predicted probabilities are less than 0 or greater than 1.')
Answer: 0.30 percent of predicted probabilities are less than 0 or greater than 1.
- Now estimate the model as a probit; i.e.,
where $\Phi( )$ is the CDF of the normal distribution.
apple_pres = smf.probit('ecobuy ~ ecoprc + regprc + faminc + hhsize + educ + age', data=apples).fit()
print(apple_pres.summary())
Optimization terminated successfully. Current function value: 0.604599 Iterations 5 Probit Regression Results ============================================================================== Dep. Variable: ecobuy No. Observations: 660 Model: Probit Df Residuals: 653 Method: MLE Df Model: 6 Date: Tue, 08 Nov 2022 Pseudo R-squ.: 0.08664 Time: 09:11:43 Log-Likelihood: -399.04 converged: True LL-Null: -436.89 Covariance Type: nonrobust LLR p-value: 2.751e-14 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -0.2438 0.474 -0.514 0.607 -1.173 0.685 ecoprc -2.2669 0.321 -7.052 0.000 -2.897 -1.637 regprc 2.0302 0.382 5.318 0.000 1.282 2.778 faminc 0.0014 0.002 0.932 0.351 -0.002 0.004 hhsize 0.0691 0.037 1.893 0.058 -0.002 0.141 educ 0.0714 0.024 2.939 0.003 0.024 0.119 age -0.0012 0.004 -0.340 0.734 -0.008 0.006 ==============================================================================
apple_pres.get_margeff?
- Compute the marginal effects of the coefficients at the means and print them out using
summary()
. Interpret the results.
probit_marg = apple_pres.get_margeff(at='mean')
print(probit_marg.summary())
Probit Marginal Effects ===================================== Dep. Variable: ecobuy Method: dydx At: mean ============================================================================== dy/dx std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ ecoprc -0.8508 0.120 -7.087 0.000 -1.086 -0.615 regprc 0.7619 0.143 5.334 0.000 0.482 1.042 faminc 0.0005 0.001 0.932 0.351 -0.001 0.002 hhsize 0.0259 0.014 1.894 0.058 -0.001 0.053 educ 0.0268 0.009 2.941 0.003 0.009 0.045 age -0.0005 0.001 -0.340 0.734 -0.003 0.002 ==============================================================================
- Re-estimate the model as a logit model.
apple_lres = smf.logit('ecobuy ~ ecoprc + regprc + faminc + hhsize + educ + age', data=apples).fit()
print(apple_lres.summary())
Optimization terminated successfully. Current function value: 0.604746 Iterations 5 Logit Regression Results ============================================================================== Dep. Variable: ecobuy No. Observations: 660 Model: Logit Df Residuals: 653 Method: MLE Df Model: 6 Date: Sun, 06 Nov 2022 Pseudo R-squ.: 0.08642 Time: 09:34:06 Log-Likelihood: -399.13 converged: True LL-Null: -436.89 Covariance Type: nonrobust LLR p-value: 3.017e-14 ============================================================================== coef std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ Intercept -0.4278 0.786 -0.544 0.586 -1.968 1.112 ecoprc -3.6773 0.533 -6.898 0.000 -4.722 -2.632 regprc 3.2742 0.630 5.196 0.000 2.039 4.509 faminc 0.0026 0.003 1.012 0.311 -0.002 0.008 hhsize 0.1145 0.061 1.878 0.060 -0.005 0.234 educ 0.1186 0.041 2.925 0.003 0.039 0.198 age -0.0022 0.006 -0.372 0.710 -0.014 0.009 ==============================================================================
- Compute the marginal effects of the logit coefficients at the averages in the data.
logit_marg = apple_lres.get_margeff(at='mean')
print(logit_marg.summary())
Logit Marginal Effects ===================================== Dep. Variable: ecobuy Method: dydx At: mean ============================================================================== dy/dx std err z P>|z| [0.025 0.975] ------------------------------------------------------------------------------ ecoprc -0.8480 0.122 -6.972 0.000 -1.086 -0.610 regprc 0.7551 0.144 5.227 0.000 0.472 1.038 faminc 0.0006 0.001 1.012 0.311 -0.001 0.002 hhsize 0.0264 0.014 1.880 0.060 -0.001 0.054 educ 0.0273 0.009 2.931 0.003 0.009 0.046 age -0.0005 0.001 -0.372 0.710 -0.003 0.002 ==============================================================================
We haven't done much data wrangling lately. I'm feeling a bit sad; I miss shaping data.
- Create a pandas DataFrame with the row index 'ecoprc', 'regprc', 'faminc', 'hhsize', 'educ', and 'age'. The columns should be labeled 'logit', 'probit', and 'ols'. The columns should contain the marginal effects for the logit and probit models and the coefficients from the ols model.
params = pd.DataFrame({'logit':logit_marg.margeff, 'probit':probit_marg.margeff, 'ols':apple_res.params[1:]},
index = ['ecoprc', 'regprc', 'faminc', 'hhsize', 'educ', 'age'],
)
params
logit | probit | ols | |
---|---|---|---|
ecoprc | -0.848039 | -0.850754 | -0.802622 |
regprc | 0.755077 | 0.761893 | 0.719268 |
faminc | 0.000610 | 0.000543 | 0.000552 |
hhsize | 0.026416 | 0.025947 | 0.023823 |
educ | 0.027349 | 0.026784 | 0.024785 |
age | -0.000503 | -0.000455 | -0.000501 |