Lecture 12: Collapsing Data [SUGGESTED SOLUTIONS]
We often want to know how groups differ. Do workers with econ degrees make more than workers with history degrees? Do men live longer than women? Does it matter how much education you have? How much debt will students at different universities have after graduation (or earn after graduation) and does this vary by ownership type (public vs private)?
The outline of today's lecture is as follows:
import pandas as pd
import matplotlib.pyplot as plt
pd.set_option('display.precision', 3) # this tells pandas to print out 3 decimal places when we print a DataFrame
1. Load college scorecard data (top)¶
Let's take this opportunity to learn about a new dataset: The College Scorecard. The data are compiled by the Department of Education to help students evaluate higher education institutions. The data are very well documented, updated annually, and include such juicy variables as: prices, after program debt levels, earnings, completion rates, information about student outcomes by family income and other demographic variables.
Let's load the data.
colscd = pd.read_csv('./Data/college_scorecard/Most-Recent-Cohorts-Scorecard-Elements.csv')
colscd.head()
UNITID | OPEID | OPEID6 | INSTNM | CITY | STABBR | INSTURL | NPCURL | HCM2 | PREDDEG | ... | OMAWDP8_NOTFIRSTTIME_POOLED_SUPP | OMENRUP_NOTFIRSTTIME_POOLED_SUPP | OMENRYP_FULLTIME_POOLED_SUPP | OMENRAP_FULLTIME_POOLED_SUPP | OMAWDP8_FULLTIME_POOLED_SUPP | OMENRUP_FULLTIME_POOLED_SUPP | OMENRYP_PARTTIME_POOLED_SUPP | OMENRAP_PARTTIME_POOLED_SUPP | OMAWDP8_PARTTIME_POOLED_SUPP | OMENRUP_PARTTIME_POOLED_SUPP | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 100654 | 100200 | 1002 | Alabama A & M University | Normal | AL | www.aamu.edu/ | https://galileo.aamu.edu/NetPriceCalculator/np... | 0 | 3 | ... | 0.4048 | 0.2645 | 0.0137 | 0.315 | 0.3755 | 0.2958 | 0.0289 | 0.4224 | 0.2635 | 0.2852 |
1 | 100663 | 105200 | 1052 | University of Alabama at Birmingham | Birmingham | AL | www.uab.edu | uab.studentaidcalculator.com/survey.aspx | 0 | 3 | ... | 0.5448 | 0.12 | 0.0728 | 0.2583 | 0.5872 | 0.0817 | 0.1398 | 0.2618 | 0.3977 | 0.2008 |
2 | 100690 | 2503400 | 25034 | Amridge University | Montgomery | AL | www.amridgeuniversity.edu | www2.amridgeuniversity.edu:9091/ | 0 | 3 | ... | 0.4647 | 0.1912 | 0.0148 | 0.3284 | 0.4391 | 0.2177 | 0.028 | 0.3645 | 0.4486 | 0.1589 |
3 | 100706 | 105500 | 1055 | University of Alabama in Huntsville | Huntsville | AL | www.uah.edu | finaid.uah.edu/ | 0 | 3 | ... | 0.5121 | 0.1593 | 0.0165 | 0.323 | 0.5361 | 0.1243 | 0.0245 | 0.3947 | 0.3476 | 0.2331 |
4 | 100724 | 100500 | 1005 | Alabama State University | Montgomery | AL | www.alasu.edu | www.alasu.edu/cost-aid/forms/calculator/index.... | 0 | 3 | ... | 0.4226 | 0.0754 | 0.0254 | 0.5327 | 0.3133 | 0.1286 | 0.021 | 0.5804 | 0.1678 | 0.2308 |
5 rows × 190 columns
colscd.columns
Index(['UNITID', 'OPEID', 'OPEID6', 'INSTNM', 'CITY', 'STABBR', 'INSTURL', 'NPCURL', 'HCM2', 'PREDDEG', ... 'OMAWDP8_NOTFIRSTTIME_POOLED_SUPP', 'OMENRUP_NOTFIRSTTIME_POOLED_SUPP', 'OMENRYP_FULLTIME_POOLED_SUPP', 'OMENRAP_FULLTIME_POOLED_SUPP', 'OMAWDP8_FULLTIME_POOLED_SUPP', 'OMENRUP_FULLTIME_POOLED_SUPP', 'OMENRYP_PARTTIME_POOLED_SUPP', 'OMENRAP_PARTTIME_POOLED_SUPP', 'OMAWDP8_PARTTIME_POOLED_SUPP', 'OMENRUP_PARTTIME_POOLED_SUPP'], dtype='object', length=190)
This dataset is too big for our needs. I went to the website to figure out what the variables mean. Let's rename the variables to something easier to understand and keep just a few variables that look interesting.
colscd = colscd.rename(columns = {'CONTROL':'ownership',
'INSTNM':'name', 'STABBR':'state',
'PREDDEG':'type', 'SATVRMID':'sat_read_med',
'SATMTMID':'sat_math_med',
'SATWRMID':'sat_write_med',
'PCIP24':'sh_las',
'PCIP51':'sh_bus',
'PCIP11':'sh_cs',
'MD_EARN_WNE_P10':'earn_10',
'GRAD_DEBT_MDN_SUPP':'debt_at_grad'})
cols_to_keep = ['name', 'state', 'ownership', 'type','sat_read_med', 'sat_math_med', 'sat_write_med',
'sh_las', 'sh_bus', 'sh_cs', 'earn_10', 'debt_at_grad']
colscd = colscd[cols_to_keep]
colscd.head()
name | state | ownership | type | sat_read_med | sat_math_med | sat_write_med | sh_las | sh_bus | sh_cs | earn_10 | debt_at_grad | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama A & M University | AL | 1 | 3 | 425.0 | 428.0 | 414.0 | 0.062 | 0.000 | 0.035 | 31000 | 32750 |
1 | University of Alabama at Birmingham | AL | 1 | 3 | 535.0 | 645.0 | NaN | 0.019 | 0.249 | 0.014 | 41200 | 21833 |
2 | Amridge University | AL | 2 | 3 | NaN | NaN | NaN | 0.127 | 0.000 | 0.000 | 39600 | 22890 |
3 | University of Alabama in Huntsville | AL | 1 | 3 | 605.0 | 600.0 | NaN | 0.000 | 0.102 | 0.075 | 46700 | 22647 |
4 | Alabama State University | AL | 1 | 3 | 433.0 | 428.0 | NaN | 0.000 | 0.183 | 0.057 | 27700 | 31500 |
The ownership and type variables are coded as integers. Let's make them easier to understand. I found the definitions on the Dept of Education website.
type_codes = {0:'na', 1:'cert', 2:'asc', 3:'bach', 4:'grad_only'}
colscd['type'] = colscd['type'].replace(type_codes)
own_codes = {1:'Public', 2:'Private nonprofit', 3:'Private profit'}
colscd['ownership'] = colscd['ownership'].replace(own_codes)
colscd.head()
name | state | ownership | type | sat_read_med | sat_math_med | sat_write_med | sh_las | sh_bus | sh_cs | earn_10 | debt_at_grad | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Alabama A & M University | AL | Public | bach | 425.0 | 428.0 | 414.0 | 0.062 | 0.000 | 0.035 | 31000 | 32750 |
1 | University of Alabama at Birmingham | AL | Public | bach | 535.0 | 645.0 | NaN | 0.019 | 0.249 | 0.014 | 41200 | 21833 |
2 | Amridge University | AL | Private nonprofit | bach | NaN | NaN | NaN | 0.127 | 0.000 | 0.000 | 39600 | 22890 |
3 | University of Alabama in Huntsville | AL | Public | bach | 605.0 | 600.0 | NaN | 0.000 | 0.102 | 0.075 | 46700 | 22647 |
4 | Alabama State University | AL | Public | bach | 433.0 | 428.0 | NaN | 0.000 | 0.183 | 0.057 | 27700 | 31500 |
Set the index to the university name.
colscd.set_index('name', inplace=True)
How do we look?
colscd.loc['University of Georgia']
state GA ownership Public type bach sat_read_med 650.0 sat_math_med 635.0 sat_write_med 610.0 sh_las 0.002 sh_bus 0.041 sh_cs 0.02 earn_10 50500 debt_at_grad 18750 Name: University of Georgia, dtype: object
Hmm... it's hard to understand these variables without some context. The goal of this lecture is to add context.
Let's look at the different data types in the DataFrame:
colscd.dtypes
state object ownership object type object sat_read_med float64 sat_math_med float64 sat_write_med float64 sh_las float64 sh_bus float64 sh_cs float64 earn_10 object debt_at_grad object dtype: object
Doh! looks like the earnings (earn_10
) and debt (debt_at_grad
) came in as objects instead of floats.
The culprit is the 'PrivacySuppressed' flag. We could have told read_csv
about this if we knew in advance.
Instead, let's practice to_numeric( )
which tries to convert a column to numeric values. I pass the parameter error='coerce'
to tell the method to set anything it cannot convert to a NaN.
colscd['earn_10'] = pd.to_numeric(colscd['earn_10'], errors='coerce')
colscd['debt_at_grad'] = pd.to_numeric(colscd['debt_at_grad'], errors='coerce')
colscd.dtypes
state object ownership object type object sat_read_med float64 sat_math_med float64 sat_write_med float64 sh_las float64 sh_bus float64 sh_cs float64 earn_10 float64 debt_at_grad float64 dtype: object
2. The "groupby" method (top)¶
Pandas provides the groupby( )
method to ease computing statistics by group (docs). This kind of method shows up in many data-oriented computing languages and packages. The idea is summed up as
split-apply-combine
Here is the canonical illustration. The big idea is to
- Split the data up into groups. The groups are defined by key variables.
- Apply some method or function to each group: mean, std, max, etc. This returns a smaller bit of data, often just one number.
- Combine the results of the 'apply' from each group into a new data structure.
We pass groupby a 'key' which tells the method which variable to, well, group by. This is the split step.
colscd_grouped = colscd.groupby('state')
What is colscd_grouped
?
print(type(colscd_grouped))
<class 'pandas.core.groupby.generic.DataFrameGroupBy'>
A DataFrameGroupBy object. This is basically a DataFrame + the grouping information.
What does it look like?
colscd_grouped.head()
state | ownership | type | sat_read_med | sat_math_med | sat_write_med | sh_las | sh_bus | sh_cs | earn_10 | debt_at_grad | |
---|---|---|---|---|---|---|---|---|---|---|---|
name | |||||||||||
Alabama A & M University | AL | Public | bach | 425.0 | 428.0 | 414.0 | 0.062 | 0.000 | 0.035 | 31000.0 | 32750.0 |
University of Alabama at Birmingham | AL | Public | bach | 535.0 | 645.0 | NaN | 0.019 | 0.249 | 0.014 | 41200.0 | 21833.0 |
Amridge University | AL | Private nonprofit | bach | NaN | NaN | NaN | 0.127 | 0.000 | 0.000 | 39600.0 | 22890.0 |
University of Alabama in Huntsville | AL | Public | bach | 605.0 | 600.0 | NaN | 0.000 | 0.102 | 0.075 | 46700.0 | 22647.0 |
Alabama State University | AL | Public | bach | 433.0 | 428.0 | NaN | 0.000 | 0.183 | 0.057 | 27700.0 | 31500.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Palau Community College | PW | Public | asc | NaN | NaN | NaN | 0.059 | 0.108 | 0.029 | NaN | NaN |
University of the Virgin Islands | VI | Public | bach | 483.0 | 435.0 | 410.0 | 0.004 | 0.091 | 0.049 | 32000.0 | 13761.5 |
College of the Marshall Islands | MH | Public | asc | NaN | NaN | NaN | 0.482 | 0.128 | 0.000 | NaN | NaN |
Pacific Islands University | GU | Private nonprofit | bach | NaN | NaN | NaN | 0.529 | 0.000 | 0.000 | NaN | NaN |
University of the Virgin Islands-Albert A. Sheen | VI | Public | na | NaN | NaN | NaN | NaN | NaN | NaN | 32000.0 | 13761.5 |
270 rows × 11 columns
Looks just like a regular DataFrame.
Aggregating by group¶
A major use of groupby is to perform some kind of aggregation. This is the apply and combine step. Let's take the grouped data and compute some means.
all_means = colscd_grouped.mean() # apply the mean operator to the grouped data
print(type(all_means)) # what do we get back?
<class 'pandas.core.frame.DataFrame'>
Ah, a DataFrame. We know what to do with that.
all_means.head(10)
sat_read_med | sat_math_med | sat_write_med | sh_las | sh_bus | sh_cs | earn_10 | debt_at_grad | |
---|---|---|---|---|---|---|---|---|
state | ||||||||
AK | 575.000 | 575.000 | 482.500 | 0.076 | 0.169 | 0.017 | 37800.000 | 13711.500 |
AL | 534.773 | 528.727 | 498.000 | 0.104 | 0.243 | 0.025 | 31548.000 | 19370.080 |
AR | 554.500 | 551.929 | 464.200 | 0.115 | 0.195 | 0.017 | 28575.000 | 15111.977 |
AS | NaN | NaN | NaN | 0.373 | 0.066 | 0.000 | 22300.000 | NaN |
AZ | 581.143 | 570.714 | 505.000 | 0.064 | 0.284 | 0.028 | 33579.310 | 13884.727 |
CA | 584.542 | 582.958 | 570.540 | 0.066 | 0.270 | 0.024 | 37350.787 | 14356.604 |
CO | 578.133 | 574.733 | 570.000 | 0.048 | 0.293 | 0.023 | 33572.000 | 16889.060 |
CT | 570.000 | 560.412 | 518.167 | 0.067 | 0.264 | 0.017 | 42060.317 | 15397.246 |
DC | 624.600 | 617.000 | 605.000 | 0.012 | 0.144 | 0.064 | 47568.750 | 23521.088 |
DE | 501.667 | 510.000 | 472.333 | 0.016 | 0.320 | 0.024 | 36263.636 | 14817.941 |
When we used mean() one the grouped data, it applied the mean method to each group, which creates one number per group (for each column). It then combined the means into a DataFrame, one number per group per column. Nice.
Notice that the categorical data (name, state, type) have been dropped.
Here we can see the result of pd.set_option('precision'). The output is limited to 3 decimal places.
Computing the grouped data first helped us understand what was happening, but we can do the whole split-apply-combine in one simple line of code.
all_means = colscd.groupby('state').mean()
all_means.head(10)
sat_read_med | sat_math_med | sat_write_med | sh_las | sh_bus | sh_cs | earn_10 | debt_at_grad | |
---|---|---|---|---|---|---|---|---|
state | ||||||||
AK | 575.000 | 575.000 | 482.500 | 0.076 | 0.169 | 0.017 | 37800.000 | 13711.500 |
AL | 534.773 | 528.727 | 498.000 | 0.104 | 0.243 | 0.025 | 31548.000 | 19370.080 |
AR | 554.500 | 551.929 | 464.200 | 0.115 | 0.195 | 0.017 | 28575.000 | 15111.977 |
AS | NaN | NaN | NaN | 0.373 | 0.066 | 0.000 | 22300.000 | NaN |
AZ | 581.143 | 570.714 | 505.000 | 0.064 | 0.284 | 0.028 | 33579.310 | 13884.727 |
CA | 584.542 | 582.958 | 570.540 | 0.066 | 0.270 | 0.024 | 37350.787 | 14356.604 |
CO | 578.133 | 574.733 | 570.000 | 0.048 | 0.293 | 0.023 | 33572.000 | 16889.060 |
CT | 570.000 | 560.412 | 518.167 | 0.067 | 0.264 | 0.017 | 42060.317 | 15397.246 |
DC | 624.600 | 617.000 | 605.000 | 0.012 | 0.144 | 0.064 | 47568.750 | 23521.088 |
DE | 501.667 | 510.000 | 472.333 | 0.016 | 0.320 | 0.024 | 36263.636 | 14817.941 |
groupby( ) with many keys¶
Can we group by several keys? You know we can. Let's compute the medians this time.
sat_medians = colscd.groupby(['ownership','type']).median()
sat_medians
sat_read_med | sat_math_med | sat_write_med | sh_las | sh_bus | sh_cs | earn_10 | debt_at_grad | ||
---|---|---|---|---|---|---|---|---|---|
ownership | type | ||||||||
Private nonprofit | asc | 475.0 | 475.0 | 438.0 | 0.000 | 0.200 | 0.000 | 32700.0 | 18750.00 |
bach | 560.0 | 545.0 | 508.0 | 0.000 | 0.032 | 0.006 | 42700.0 | 25000.00 | |
cert | 535.0 | 517.0 | 480.0 | 0.000 | 0.686 | 0.000 | 26800.0 | 10556.00 | |
grad_only | NaN | NaN | NaN | 0.000 | 0.000 | 0.000 | 83100.0 | 17361.00 | |
na | NaN | NaN | NaN | NaN | NaN | NaN | 32600.0 | 23340.00 | |
Private profit | asc | NaN | NaN | NaN | 0.000 | 0.525 | 0.000 | 28450.0 | 20635.75 |
bach | 550.0 | 523.0 | 488.5 | 0.000 | 0.000 | 0.037 | 41400.0 | 30481.00 | |
cert | NaN | NaN | NaN | 0.000 | 0.000 | 0.000 | 22350.0 | 9500.00 | |
grad_only | NaN | NaN | NaN | 0.000 | 0.000 | 0.000 | 34500.0 | 30352.00 | |
na | NaN | NaN | NaN | NaN | NaN | NaN | 34600.0 | 18254.00 | |
Public | asc | 512.5 | 495.0 | 443.0 | 0.315 | 0.181 | 0.027 | 31500.0 | 9655.50 |
bach | 550.0 | 548.0 | 490.0 | 0.019 | 0.093 | 0.025 | 43000.0 | 21500.00 | |
cert | NaN | NaN | 391.0 | 0.000 | 0.345 | 0.008 | 28900.0 | 9342.50 | |
grad_only | NaN | NaN | NaN | NaN | NaN | NaN | 102100.0 | 21750.00 | |
na | NaN | NaN | NaN | NaN | NaN | NaN | 34600.0 | 11848.50 |
Now we have a MultiIndexed DataFrame with the summary statistics, this time, the median.
groupby( ) on a subset of columns¶
We may not care about all the columns in our datset for a particular groupby. Taking a subset of columns is straightforward.
Let's focus on the median SAT scores.
sat_medians = colscd.groupby(['ownership','type'])[['sat_read_med', 'sat_math_med', 'sat_write_med']].median()
sat_medians
sat_read_med | sat_math_med | sat_write_med | ||
---|---|---|---|---|
ownership | type | |||
Private nonprofit | asc | 475.0 | 475.0 | 438.0 |
bach | 560.0 | 545.0 | 508.0 | |
cert | 535.0 | 517.0 | 480.0 | |
grad_only | NaN | NaN | NaN | |
na | NaN | NaN | NaN | |
Private profit | asc | NaN | NaN | NaN |
bach | 550.0 | 523.0 | 488.5 | |
cert | NaN | NaN | NaN | |
grad_only | NaN | NaN | NaN | |
na | NaN | NaN | NaN | |
Public | asc | 512.5 | 495.0 | 443.0 |
bach | 550.0 | 548.0 | 490.0 | |
cert | NaN | NaN | 391.0 | |
grad_only | NaN | NaN | NaN | |
na | NaN | NaN | NaN |
The three ownership types all have institutions that predominately offer bachelors degrees. Let's grab that set of statistics.
bach_sat_med = sat_medians.xs('bach', level='type') # xs() indexes data from a MultiIndex
print(bach_sat_med)
sat_read_med sat_math_med sat_write_med ownership Private nonprofit 560.0 545.0 508.0 Private profit 550.0 523.0 488.5 Public 550.0 548.0 490.0
How do the median SAT scores compare across public and private institutions?
There are a few new plotting tricks here...
fig, ax = plt.subplots(1, 3, figsize=(21,6))
# Set up the color scheme. This makes it easier to fiddle with.
bar_color = 'red'
bar_alpha = 0.35
# Plot one SAT variable on each axes
ax[0].bar(bach_sat_med.index, bach_sat_med['sat_math_med'], color=bar_color, alpha=bar_alpha)
ax[1].bar(bach_sat_med.index, bach_sat_med['sat_read_med'], color=bar_color, alpha=bar_alpha)
ax[2].bar(bach_sat_med.index, bach_sat_med['sat_write_med'],color=bar_color, alpha=bar_alpha)
# Titles!
ax[0].set_title('SAT reading')
ax[1].set_title('SAT math')
ax[2].set_title('SAT writing')
# I am only setting the ylabel on the left-most. Save some non-data ink.
ax[0].set_ylabel('Median score')
# Set these common parameters by looping over the axes.
for a in ax:
a.spines['top'].set_visible(False)
a.spines['right'].set_visible(False)
a.grid(axis='y', color='gray',linestyle='dashed') # Is this too much clutter? Maybe.
a.xaxis.set_tick_params(length=0) # Kill the xaxis ticks marks
a.yaxis.set_tick_params(length=0) # Kill the yaxis tick marks
a.set_ylim(0,600) # Make sure all plots have the same y-axis to make comparisons easier.
plt.show()
Interesting. Private for-profit institutions seem to have about the same quality of writing scores, a bit lower math scores and substantially lower reading scores. Once we fire up some stats model packages, we can do formal tests to see if they are significantly different.
Grouping variables as index¶
In all of the code above we groupby
placed the grouping variables as the index but we don't have to do that. We can instead using the as_index
argument to tell the computer to leave the grouping variables as variables.
colscd.groupby(['ownership','type'],as_index=False)[['sat_read_med', 'sat_math_med', 'sat_write_med']].median()
ownership | type | sat_read_med | sat_math_med | sat_write_med | |
---|---|---|---|---|---|
0 | Private nonprofit | asc | 475.0 | 475.0 | 438.0 |
1 | Private nonprofit | bach | 560.0 | 545.0 | 508.0 |
2 | Private nonprofit | cert | 535.0 | 517.0 | 480.0 |
3 | Private nonprofit | grad_only | NaN | NaN | NaN |
4 | Private nonprofit | na | NaN | NaN | NaN |
5 | Private profit | asc | NaN | NaN | NaN |
6 | Private profit | bach | 550.0 | 523.0 | 488.5 |
7 | Private profit | cert | NaN | NaN | NaN |
8 | Private profit | grad_only | NaN | NaN | NaN |
9 | Private profit | na | NaN | NaN | NaN |
10 | Public | asc | 512.5 | 495.0 | 443.0 |
11 | Public | bach | 550.0 | 548.0 | 490.0 |
12 | Public | cert | NaN | NaN | 391.0 |
13 | Public | grad_only | NaN | NaN | NaN |
14 | Public | na | NaN | NaN | NaN |
Practice</font >
- Create a dataset with only public institutions. Name it
pub
pub = colscd[colscd['ownership']=='Public']
The quantile( )
method computes quantiles from the data. (e.g., quantile(0.5)
computes the median, or the the 50th quantile)
Let's look at a measure of the earnings spread for different institution types
a. Compute the 75th quantile for 'earn_10' for each 'type'.
b. Compute the 50th quantile for 'earn_10' for each 'type'.
c. Compute the 25th quantile for 'earn_10' for each 'type'.
q75 = pub.groupby('type')[['earn_10']].quantile(0.75)
q50 = pub.groupby('type')[['earn_10']].quantile(0.50)
q25 = pub.groupby('type')[['earn_10']].quantile(0.25)
2d. For each type, compute the difference between the 75 percentile and the 25 percentile. This is known as the intre-quartile range (IQR). Divide the IQR by the median. This gives us a notion of spread, standardized by the median.
spread = (q75 - q25)/ q50
spread
earn_10 | |
---|---|
type | |
asc | 0.196 |
bach | 0.266 |
cert | 0.266 |
grad_only | 0.917 |
na | 0.199 |
Wow, a lot of dispersion in the "grad_only" group. Let's practice some more.
- How do reading and writing scores correlate? Compute the median SAT reading and writing scores by state.
sat_rw_med = pub.groupby('state')[['sat_read_med', 'sat_write_med']].median()
sat_rw_med.head()
sat_read_med | sat_write_med | |
---|---|---|
state | ||
AK | 575.0 | 510.0 |
AL | 537.5 | 514.0 |
AR | 517.0 | 450.0 |
AS | NaN | NaN |
AZ | 590.0 | NaN |
- Create a scatter plot with the median reading score on the x axis and writing score on the y axis.
fig, ax = plt.subplots(figsize=(10,6))
ax.scatter(sat_rw_med['sat_read_med'], sat_rw_med['sat_write_med'], color='black', marker='x')
ax.plot([400, 600], [400, 600], color='black', linewidth=0.75) # 45-degree line
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_xlabel('median SAT reading score')
ax.set_ylabel('median SAT writing score')
plt.show()
Working with quantiles¶
We can group the data into bins using the cut
method.
bus_quart = pd.cut(colscd['sh_bus'], 3, right=False)
bus_quart[:10]
name Alabama A & M University [0.0, 0.333) University of Alabama at Birmingham [0.0, 0.333) Amridge University [0.0, 0.333) University of Alabama in Huntsville [0.0, 0.333) Alabama State University [0.0, 0.333) The University of Alabama [0.0, 0.333) Central Alabama Community College [0.0, 0.333) Athens State University [0.0, 0.333) Auburn University at Montgomery [0.0, 0.333) Auburn University [0.0, 0.333) Name: sh_bus, dtype: category Categories (3, interval[float64, left]): [[0.0, 0.333) < [0.333, 0.667) < [0.667, 1.001)]
Remember, cut returns a Categorical object. We can use this object as our key variable in a groupby.
earn_bus = colscd.groupby(bus_quart)['earn_10']
Notice that we have used a key variable that was not a column of the DataFrame 'colscd'. The key variable can be very general. McKinney covers some other kinds of key variables.
We now have a DatFrameGroupedBy object. What is the distribution of institutions?
earn_bus.count()
sh_bus [0.0, 0.333) 3435 [0.333, 0.667) 631 [0.667, 1.001) 775 Name: earn_10, dtype: int64
Comment:
There is also a function called qcut
. The difference between the two amounts to how the bins are constructed:
cut
chooses the bins to be evenly spaced according to the values themselves and not the frequency of those values. Think of this as a tabular form of a histogram.
qcut
chooses the bins such that the same number of records are in each bin. For example, if you have 30 records and ask for five bins (quantiles), there will be 6 observations in each bin.
Several statistics at once¶
Once we have grouped our data, we have been hitting it with methods to compute statistics: mean(), count(),...
We now introduced the agg( )
method, which lets us compute several moments at once --- you can even pass it a user defined function.
# This is the same as earn_bus.count()
earn_bus.agg('count')
sh_bus [0.0, 0.333) 3435 [0.333, 0.667) 631 [0.667, 1.001) 775 Name: earn_10, dtype: int64
# But agg() lets us compute many stats at once
earn_bus.agg(['count', 'mean', 'median', 'std'])
count | mean | median | std | |
---|---|---|---|---|
sh_bus | ||||
[0.0, 0.333) | 3435 | 35197.147 | 34300.0 | 12988.513 |
[0.333, 0.667) | 631 | 31037.876 | 29400.0 | 8657.804 |
[0.667, 1.001) | 775 | 33277.419 | 27900.0 | 15759.991 |
Schools that focus on business degrees don't seem to offer greater earnings opportunities.
Practice</font >
- Write a function that returns the average of the 5 largest elements of a Series (a column of a DataFrame). Name the function 'avg5'. Go back to the lecture on user-defined functions to refresh your memory on how to create a function and read it into memory.
def avg5(x):
top5 = x.sort_values(ascending=False)[:5] # sort x, then grab the five largest. ".sort_values" defaults to ascending
return top5.mean() # return the mean
- Test your function on column 'a' of the DataFrame defined below. The answer should be 8.
test = pd.DataFrame({'a':[1, 4, 6, 9, 10, 3, 7, 8], 'b':[2, 3, 4, 5, 6, 7, 8, 10]
test = pd.DataFrame({'a':[1, 4, 6, 9, 10, 3, 7, 8], 'b':[2, 3, 4, 5, 6, 7, 8, 10] })
avg5(test['a'])
8.0
- Back to the College Scorecard. Use the
.dropna()
method to drop any observation that has 'debt_at_grad' == NaN. Use thesubset
argument to tell the computer to just drop observations where thedebt_at_grad
is NaN.
colscd.dropna(subset=['debt_at_grad'] , inplace=True)
colscd.dropna(subset=['debt_at_grad'] , inplace=True) # Drop the missing values
- Compute the mean, median, and avg5 'debt_at_grad' by 'ownership'. Compute them all at once using
agg()
.
colscd.groupby('ownership')['debt_at_grad'].agg(['mean', 'median', avg5]) # Compute the summary stats
mean | median | avg5 | |
---|---|---|---|
ownership | |||
Private nonprofit | 22446.198 | 24750.0 | 43574.2 |
Private profit | 14073.736 | 10556.0 | 45643.7 |
Public | 14203.894 | 12000.0 | 34669.2 |
- Create a function which calculates the average for the Top Ten schools. Redo (4) but now include this new statistic.
def avg10(x):
top10 = x.sort_values(ascending=False)[:10] # sort x, then grab the ten largest. ".sort_values" defaults to ascending
return top10.mean() # return the mean
colscd.groupby('ownership')['debt_at_grad'].agg(['mean', 'median', avg5, avg10]) # Compute the summary stats
mean | median | avg5 | avg10 | |
---|---|---|---|---|
ownership | ||||
Private nonprofit | 22446.198 | 24750.0 | 43574.2 | 42062.10 |
Private profit | 14073.736 | 10556.0 | 45643.7 | 40886.65 |
Public | 14203.894 | 12000.0 | 34669.2 | 33170.80 |
Different Statistics for different variables¶
We may want to use different statistics for different variables. We can do that too by passing a dict to .agg()
. First we list the variable we want to evaluate in the DataFrame and then we tell the computer which statistic to calculate.
colscd.groupby('ownership').agg({'earn_10':'median',
'debt_at_grad':'median',
'sh_las':'mean',
'sh_bus':'mean'})
earn_10 | debt_at_grad | sh_las | sh_bus | |
---|---|---|---|---|
ownership | ||||
Private nonprofit | 41700.0 | 24750.0 | 3.880e-02 | 0.212 |
Private profit | 24800.0 | 10556.0 | 3.629e-04 | 0.312 |
Public | 33400.0 | 12000.0 | 1.642e-01 | 0.250 |
And we can generate multiple statistics for a given variable too by passing a list of statistics to .agg
for the corresponding variable.
colscd.groupby('ownership').agg({'earn_10':['mean','median'],
'debt_at_grad':['mean','median'],
'sh_las':'mean',
'sh_bus':'mean'})
earn_10 | debt_at_grad | sh_las | sh_bus | |||
---|---|---|---|---|---|---|
mean | median | mean | median | mean | mean | |
ownership | ||||||
Private nonprofit | 43837.040 | 41700.0 | 22446.198 | 24750.0 | 3.880e-02 | 0.212 |
Private profit | 27616.907 | 24800.0 | 14073.736 | 10556.0 | 3.629e-04 | 0.312 |
Public | 36071.245 | 33400.0 | 14203.894 | 12000.0 | 1.642e-01 | 0.250 |
The columns are a MultiIndex which I thnk makes it hard to read. We can combine the MultiIndex (in pythonese "flatten") to create new column names which are easier to read (and informative).
df = colscd.groupby('ownership').agg({'earn_10':['mean','median'],
'debt_at_grad':['mean','median'],
'sh_las':'mean',
'sh_bus':'mean'})
df.columns = [': '.join(col) for col in df.columns.values]
df.head()
earn_10: mean | earn_10: median | debt_at_grad: mean | debt_at_grad: median | sh_las: mean | sh_bus: mean | |
---|---|---|---|---|---|---|
ownership | ||||||
Private nonprofit | 43837.040 | 41700.0 | 22446.198 | 24750.0 | 3.880e-02 | 0.212 |
Private profit | 27616.907 | 24800.0 | 14073.736 | 10556.0 | 3.629e-04 | 0.312 |
Public | 36071.245 | 33400.0 | 14203.894 | 12000.0 | 1.642e-01 | 0.250 |
Adding summary statistics to the original DataFrame.¶
We can also generate statistics and add them back to the original dataframe using the transform
method. For instance, maybe we
want to add the median score by ownership type to the original dataframe.
colscd['median by owner'] = colscd.groupby('ownership')['debt_at_grad'].transform('median')
Doing this enables us to compare observations to the relevant group statistic.
colscd['relative debt'] = colscd['debt_at_grad']/colscd['median by owner']
Let's see how we compare to similar institutions:
val = colscd.loc['University of Georgia']['relative debt']
print(f'The relative debt of the median UGA student is {val:4.2f} times the median value of public institutions.')
The relative debt of the median UGA student is 1.56 times the median value of public institutions.
Wow. That's quite the difference. Compare student debt with earnings to see the relative value of a college education. You could plot the results and look for patterns.