Lecture12_groupby_finished

Lecture 12: Collapsing Data [SUGGESTED SOLUTIONS]

We often want to know how groups differ. Do workers with econ degrees make more than workers with history degrees? Do men live longer than women? Does it matter how much education you have? How much debt will students at different universities have after graduation (or earn after graduation) and does this vary by ownership type (public vs private)?

The outline of today's lecture is as follows:

Load the college scorecard data.

The "groupby" method

Extending "groupby"

In [2]:

import pandas as pd
import matplotlib.pyplot as plt

pd.set_option('display.precision', 3)       # this tells pandas to print out 3 decimal places when we print a DataFrame

1. Load college scorecard data (top)¶

Let's take this opportunity to learn about a new dataset: The College Scorecard. The data are compiled by the Department of Education to help students evaluate higher education institutions. The data are very well documented, updated annually, and include such juicy variables as: prices, after program debt levels, earnings, completion rates, information about student outcomes by family income and other demographic variables.

Let's load the data.

In [3]:

colscd = pd.read_csv('./Data/college_scorecard/Most-Recent-Cohorts-Scorecard-Elements.csv')
colscd.head()

Out[3]:

	UNITID	OPEID	OPEID6	INSTNM	CITY	STABBR	INSTURL	NPCURL	PREDDEG	...	OMAWDP8_NOTFIRSTTIME_POOLED_SUPP	OMENRUP_NOTFIRSTTIME_POOLED_SUPP	OMENRYP_FULLTIME_POOLED_SUPP	OMENRAP_FULLTIME_POOLED_SUPP	OMAWDP8_FULLTIME_POOLED_SUPP	OMENRUP_FULLTIME_POOLED_SUPP	OMENRYP_PARTTIME_POOLED_SUPP	OMENRAP_PARTTIME_POOLED_SUPP	OMAWDP8_PARTTIME_POOLED_SUPP	OMENRUP_PARTTIME_POOLED_SUPP
0	100654	100200	1002	Alabama A & M University	Normal	AL	www.aamu.edu/	https://galileo.aamu.edu/NetPriceCalculator/np...	3	...	0.4048	0.2645	0.0137	0.315	0.3755	0.2958	0.0289	0.4224	0.2635	0.2852
1	100663	105200	1052	University of Alabama at Birmingham	Birmingham	AL	www.uab.edu	uab.studentaidcalculator.com/survey.aspx	3	...	0.5448	0.12	0.0728	0.2583	0.5872	0.0817	0.1398	0.2618	0.3977	0.2008
2	100690	2503400	25034	Amridge University	Montgomery	AL	www.amridgeuniversity.edu	www2.amridgeuniversity.edu:9091/	3	...	0.4647	0.1912	0.0148	0.3284	0.4391	0.2177	0.028	0.3645	0.4486	0.1589
3	100706	105500	1055	University of Alabama in Huntsville	Huntsville	AL	www.uah.edu	finaid.uah.edu/	3	...	0.5121	0.1593	0.0165	0.323	0.5361	0.1243	0.0245	0.3947	0.3476	0.2331
4	100724	100500	1005	Alabama State University	Montgomery	AL	www.alasu.edu	www.alasu.edu/cost-aid/forms/calculator/index....	3	...	0.4226	0.0754	0.0254	0.5327	0.3133	0.1286	0.021	0.5804	0.1678	0.2308

5 rows × 190 columns

In [4]:

colscd.columns

Out[4]:

Index(['UNITID', 'OPEID', 'OPEID6', 'INSTNM', 'CITY', 'STABBR', 'INSTURL',
       'NPCURL', 'HCM2', 'PREDDEG',
       ...
       'OMAWDP8_NOTFIRSTTIME_POOLED_SUPP', 'OMENRUP_NOTFIRSTTIME_POOLED_SUPP',
       'OMENRYP_FULLTIME_POOLED_SUPP', 'OMENRAP_FULLTIME_POOLED_SUPP',
       'OMAWDP8_FULLTIME_POOLED_SUPP', 'OMENRUP_FULLTIME_POOLED_SUPP',
       'OMENRYP_PARTTIME_POOLED_SUPP', 'OMENRAP_PARTTIME_POOLED_SUPP',
       'OMAWDP8_PARTTIME_POOLED_SUPP', 'OMENRUP_PARTTIME_POOLED_SUPP'],
      dtype='object', length=190)

This dataset is too big for our needs. I went to the website to figure out what the variables mean. Let's rename the variables to something easier to understand and keep just a few variables that look interesting.

In [5]:

colscd = colscd.rename(columns = {'CONTROL':'ownership', 
                                  'INSTNM':'name', 'STABBR':'state', 
                                  'PREDDEG':'type', 'SATVRMID':'sat_read_med', 
                                  'SATMTMID':'sat_math_med', 
                                  'SATWRMID':'sat_write_med', 
                                  'PCIP24':'sh_las', 
                                  'PCIP51':'sh_bus',
                                  'PCIP11':'sh_cs', 
                                  'MD_EARN_WNE_P10':'earn_10', 
                                  'GRAD_DEBT_MDN_SUPP':'debt_at_grad'})

cols_to_keep = ['name', 'state', 'ownership', 'type','sat_read_med',  'sat_math_med', 'sat_write_med',
                'sh_las', 'sh_bus', 'sh_cs', 'earn_10', 'debt_at_grad']

colscd = colscd[cols_to_keep]
colscd.head()

Out[5]:

	name	state	ownership	type	sat_read_med	sat_math_med	sat_write_med	sh_las	sh_bus	sh_cs	earn_10	debt_at_grad
0	Alabama A & M University	AL	1	3	425.0	428.0	414.0	0.062	0.000	0.035	31000	32750
1	University of Alabama at Birmingham	AL	1	3	535.0	645.0	NaN	0.019	0.249	0.014	41200	21833
2	Amridge University	AL	2	3	NaN	NaN	NaN	0.127	0.000	0.000	39600	22890
3	University of Alabama in Huntsville	AL	1	3	605.0	600.0	NaN	0.000	0.102	0.075	46700	22647
4	Alabama State University	AL	1	3	433.0	428.0	NaN	0.000	0.183	0.057	27700	31500

The ownership and type variables are coded as integers. Let's make them easier to understand. I found the definitions on the Dept of Education website.

In [6]:

type_codes = {0:'na', 1:'cert', 2:'asc', 3:'bach', 4:'grad_only'}
colscd['type'] = colscd['type'].replace(type_codes)

own_codes = {1:'Public', 2:'Private nonprofit', 3:'Private profit'}
colscd['ownership'] = colscd['ownership'].replace(own_codes)
colscd.head()

Out[6]:

	name	state	ownership	type	sat_read_med	sat_math_med	sat_write_med	sh_las	sh_bus	sh_cs	earn_10	debt_at_grad
0	Alabama A & M University	AL	Public	bach	425.0	428.0	414.0	0.062	0.000	0.035	31000	32750
1	University of Alabama at Birmingham	AL	Public	bach	535.0	645.0	NaN	0.019	0.249	0.014	41200	21833
2	Amridge University	AL	Private nonprofit	bach	NaN	NaN	NaN	0.127	0.000	0.000	39600	22890
3	University of Alabama in Huntsville	AL	Public	bach	605.0	600.0	NaN	0.000	0.102	0.075	46700	22647
4	Alabama State University	AL	Public	bach	433.0	428.0	NaN	0.000	0.183	0.057	27700	31500

Set the index to the university name.

In [7]:

colscd.set_index('name', inplace=True)

How do we look?

In [8]:

colscd.loc['University of Georgia']

Out[8]:

state                GA
ownership        Public
type               bach
sat_read_med      650.0
sat_math_med      635.0
sat_write_med     610.0
sh_las            0.002
sh_bus            0.041
sh_cs              0.02
earn_10           50500
debt_at_grad      18750
Name: University of Georgia, dtype: object

Hmm... it's hard to understand these variables without some context. The goal of this lecture is to add context.

Let's look at the different data types in the DataFrame:

In [9]:

colscd.dtypes

Out[9]:

state             object
ownership         object
type              object
sat_read_med     float64
sat_math_med     float64
sat_write_med    float64
sh_las           float64
sh_bus           float64
sh_cs            float64
earn_10           object
debt_at_grad      object
dtype: object

Doh! looks like the earnings (earn_10) and debt (debt_at_grad) came in as objects instead of floats.

The culprit is the 'PrivacySuppressed' flag. We could have told read_csv about this if we knew in advance.

Instead, let's practice to_numeric( ) which tries to convert a column to numeric values. I pass the parameter error='coerce' to tell the method to set anything it cannot convert to a NaN.

In [10]:

colscd['earn_10'] = pd.to_numeric(colscd['earn_10'], errors='coerce')
colscd['debt_at_grad'] = pd.to_numeric(colscd['debt_at_grad'], errors='coerce')
colscd.dtypes

Out[10]:

state             object
ownership         object
type              object
sat_read_med     float64
sat_math_med     float64
sat_write_med    float64
sh_las           float64
sh_bus           float64
sh_cs            float64
earn_10          float64
debt_at_grad     float64
dtype: object

2. The "groupby" method (top)¶

Pandas provides the groupby( ) method to ease computing statistics by group (docs). This kind of method shows up in many data-oriented computing languages and packages. The idea is summed up as

split-apply-combine

Here is the canonical illustration. The big idea is to

Split the data up into groups. The groups are defined by key variables.
Apply some method or function to each group: mean, std, max, etc. This returns a smaller bit of data, often just one number.
Combine the results of the 'apply' from each group into a new data structure.

We pass groupby a 'key' which tells the method which variable to, well, group by. This is the split step.

In [11]:

colscd_grouped = colscd.groupby('state')

What is colscd_grouped?

In [12]:

print(type(colscd_grouped))

<class 'pandas.core.groupby.generic.DataFrameGroupBy'>

A DataFrameGroupBy object. This is basically a DataFrame + the grouping information.

What does it look like?

In [13]:

colscd_grouped.head()

Out[13]:

	state	ownership	type	sat_read_med	sat_math_med	sat_write_med	sh_las	sh_bus	sh_cs	earn_10	debt_at_grad
name
Alabama A & M University	AL	Public	bach	425.0	428.0	414.0	0.062	0.000	0.035	31000.0	32750.0
University of Alabama at Birmingham	AL	Public	bach	535.0	645.0	NaN	0.019	0.249	0.014	41200.0	21833.0
Amridge University	AL	Private nonprofit	bach	NaN	NaN	NaN	0.127	0.000	0.000	39600.0	22890.0
University of Alabama in Huntsville	AL	Public	bach	605.0	600.0	NaN	0.000	0.102	0.075	46700.0	22647.0
Alabama State University	AL	Public	bach	433.0	428.0	NaN	0.000	0.183	0.057	27700.0	31500.0
...	...	...	...	...	...	...	...	...	...	...	...
Palau Community College	PW	Public	asc	NaN	NaN	NaN	0.059	0.108	0.029	NaN	NaN
University of the Virgin Islands	VI	Public	bach	483.0	435.0	410.0	0.004	0.091	0.049	32000.0	13761.5
College of the Marshall Islands	MH	Public	asc	NaN	NaN	NaN	0.482	0.128	0.000	NaN	NaN
Pacific Islands University	GU	Private nonprofit	bach	NaN	NaN	NaN	0.529	0.000	0.000	NaN	NaN
University of the Virgin Islands-Albert A. Sheen	VI	Public	na	NaN	NaN	NaN	NaN	NaN	NaN	32000.0	13761.5

270 rows × 11 columns

Looks just like a regular DataFrame.

Aggregating by group¶

A major use of groupby is to perform some kind of aggregation. This is the apply and combine step. Let's take the grouped data and compute some means.

In [14]:

all_means = colscd_grouped.mean()  # apply the mean operator to the grouped data

print(type(all_means))             # what do we get back?

<class 'pandas.core.frame.DataFrame'>

Ah, a DataFrame. We know what to do with that.

In [15]:

all_means.head(10)

Out[15]:

	sat_read_med	sat_math_med	sat_write_med	sh_las	sh_bus	sh_cs	earn_10	debt_at_grad
state
AK	575.000	575.000	482.500	0.076	0.169	0.017	37800.000	13711.500
AL	534.773	528.727	498.000	0.104	0.243	0.025	31548.000	19370.080
AR	554.500	551.929	464.200	0.115	0.195	0.017	28575.000	15111.977
AS	NaN	NaN	NaN	0.373	0.066	0.000	22300.000	NaN
AZ	581.143	570.714	505.000	0.064	0.284	0.028	33579.310	13884.727
CA	584.542	582.958	570.540	0.066	0.270	0.024	37350.787	14356.604
CO	578.133	574.733	570.000	0.048	0.293	0.023	33572.000	16889.060
CT	570.000	560.412	518.167	0.067	0.264	0.017	42060.317	15397.246
DC	624.600	617.000	605.000	0.012	0.144	0.064	47568.750	23521.088
DE	501.667	510.000	472.333	0.016	0.320	0.024	36263.636	14817.941

When we used mean() one the grouped data, it applied the mean method to each group, which creates one number per group (for each column). It then combined the means into a DataFrame, one number per group per column. Nice.

Notice that the categorical data (name, state, type) have been dropped.

Here we can see the result of pd.set_option('precision'). The output is limited to 3 decimal places.

Computing the grouped data first helped us understand what was happening, but we can do the whole split-apply-combine in one simple line of code.

In [16]:

all_means = colscd.groupby('state').mean()
all_means.head(10)

Out[16]:

	sat_read_med	sat_math_med	sat_write_med	sh_las	sh_bus	sh_cs	earn_10	debt_at_grad
state
AK	575.000	575.000	482.500	0.076	0.169	0.017	37800.000	13711.500
AL	534.773	528.727	498.000	0.104	0.243	0.025	31548.000	19370.080
AR	554.500	551.929	464.200	0.115	0.195	0.017	28575.000	15111.977
AS	NaN	NaN	NaN	0.373	0.066	0.000	22300.000	NaN
AZ	581.143	570.714	505.000	0.064	0.284	0.028	33579.310	13884.727
CA	584.542	582.958	570.540	0.066	0.270	0.024	37350.787	14356.604
CO	578.133	574.733	570.000	0.048	0.293	0.023	33572.000	16889.060
CT	570.000	560.412	518.167	0.067	0.264	0.017	42060.317	15397.246
DC	624.600	617.000	605.000	0.012	0.144	0.064	47568.750	23521.088
DE	501.667	510.000	472.333	0.016	0.320	0.024	36263.636	14817.941

groupby( ) with many keys¶

Can we group by several keys? You know we can. Let's compute the medians this time.

In [17]:

sat_medians = colscd.groupby(['ownership','type']).median()
sat_medians

Out[17]:

		sat_read_med	sat_math_med	sat_write_med	sh_las	sh_bus	sh_cs	earn_10	debt_at_grad
ownership	type
Private nonprofit	asc	475.0	475.0	438.0	0.000	0.200	0.000	32700.0	18750.00
	bach	560.0	545.0	508.0	0.000	0.032	0.006	42700.0	25000.00
	cert	535.0	517.0	480.0	0.000	0.686	0.000	26800.0	10556.00
	grad_only	NaN	NaN	NaN	0.000	0.000	0.000	83100.0	17361.00
	na	NaN	NaN	NaN	NaN	NaN	NaN	32600.0	23340.00
Private profit	asc	NaN	NaN	NaN	0.000	0.525	0.000	28450.0	20635.75
	bach	550.0	523.0	488.5	0.000	0.000	0.037	41400.0	30481.00
	cert	NaN	NaN	NaN	0.000	0.000	0.000	22350.0	9500.00
	grad_only	NaN	NaN	NaN	0.000	0.000	0.000	34500.0	30352.00
	na	NaN	NaN	NaN	NaN	NaN	NaN	34600.0	18254.00
Public	asc	512.5	495.0	443.0	0.315	0.181	0.027	31500.0	9655.50
	bach	550.0	548.0	490.0	0.019	0.093	0.025	43000.0	21500.00
	cert	NaN	NaN	391.0	0.000	0.345	0.008	28900.0	9342.50
	grad_only	NaN	NaN	NaN	NaN	NaN	NaN	102100.0	21750.00
	na	NaN	NaN	NaN	NaN	NaN	NaN	34600.0	11848.50

Now we have a MultiIndexed DataFrame with the summary statistics, this time, the median.

groupby( ) on a subset of columns¶

We may not care about all the columns in our datset for a particular groupby. Taking a subset of columns is straightforward.

Let's focus on the median SAT scores.

In [18]:

sat_medians = colscd.groupby(['ownership','type'])[['sat_read_med', 'sat_math_med', 'sat_write_med']].median()
sat_medians

Out[18]:

		sat_read_med	sat_math_med	sat_write_med
ownership	type
Private nonprofit	asc	475.0	475.0	438.0
	bach	560.0	545.0	508.0
	cert	535.0	517.0	480.0
	grad_only	NaN	NaN	NaN
	na	NaN	NaN	NaN
Private profit	asc	NaN	NaN	NaN
	bach	550.0	523.0	488.5
	cert	NaN	NaN	NaN
	grad_only	NaN	NaN	NaN
	na	NaN	NaN	NaN
Public	asc	512.5	495.0	443.0
	bach	550.0	548.0	490.0
	cert	NaN	NaN	391.0
	grad_only	NaN	NaN	NaN
	na	NaN	NaN	NaN

The three ownership types all have institutions that predominately offer bachelors degrees. Let's grab that set of statistics.

In [19]:

bach_sat_med = sat_medians.xs('bach', level='type')         # xs() indexes data from a MultiIndex
print(bach_sat_med)

                   sat_read_med  sat_math_med  sat_write_med
ownership                                                   
Private nonprofit         560.0         545.0          508.0
Private profit            550.0         523.0          488.5
Public                    550.0         548.0          490.0

How do the median SAT scores compare across public and private institutions?

There are a few new plotting tricks here...

In [20]:

fig, ax = plt.subplots(1, 3, figsize=(21,6))

# Set up the color scheme. This makes it easier to fiddle with.
bar_color = 'red'
bar_alpha = 0.35

# Plot one SAT variable on each axes
ax[0].bar(bach_sat_med.index, bach_sat_med['sat_math_med'], color=bar_color, alpha=bar_alpha)
ax[1].bar(bach_sat_med.index, bach_sat_med['sat_read_med'], color=bar_color, alpha=bar_alpha)
ax[2].bar(bach_sat_med.index, bach_sat_med['sat_write_med'],color=bar_color, alpha=bar_alpha)

# Titles!
ax[0].set_title('SAT reading')
ax[1].set_title('SAT math')
ax[2].set_title('SAT writing')

# I am only setting the ylabel on the left-most. Save some non-data ink.
ax[0].set_ylabel('Median score')

# Set these common parameters by looping over the axes.
for a in ax:
    a.spines['top'].set_visible(False)
    a.spines['right'].set_visible(False)
    a.grid(axis='y', color='gray',linestyle='dashed') # Is this too much clutter? Maybe.
    a.xaxis.set_tick_params(length=0)                 # Kill the xaxis ticks marks
    a.yaxis.set_tick_params(length=0)                 # Kill the yaxis tick marks
    a.set_ylim(0,600)                                 # Make sure all plots have the same y-axis to make comparisons easier.

plt.show()

Interesting. Private for-profit institutions seem to have about the same quality of writing scores, a bit lower math scores and substantially lower reading scores. Once we fire up some stats model packages, we can do formal tests to see if they are significantly different.

Grouping variables as index¶

In all of the code above we groupby placed the grouping variables as the index but we don't have to do that. We can instead using the as_index argument to tell the computer to leave the grouping variables as variables.

In [21]:

colscd.groupby(['ownership','type'],as_index=False)[['sat_read_med', 'sat_math_med', 'sat_write_med']].median()

Out[21]:

	ownership	type	sat_read_med	sat_math_med	sat_write_med
0	Private nonprofit	asc	475.0	475.0	438.0
1	Private nonprofit	bach	560.0	545.0	508.0
2	Private nonprofit	cert	535.0	517.0	480.0
3	Private nonprofit	grad_only	NaN	NaN	NaN
4	Private nonprofit	na	NaN	NaN	NaN
5	Private profit	asc	NaN	NaN	NaN
6	Private profit	bach	550.0	523.0	488.5
7	Private profit	cert	NaN	NaN	NaN
8	Private profit	grad_only	NaN	NaN	NaN
9	Private profit	na	NaN	NaN	NaN
10	Public	asc	512.5	495.0	443.0
11	Public	bach	550.0	548.0	490.0
12	Public	cert	NaN	NaN	391.0
13	Public	grad_only	NaN	NaN	NaN
14	Public	na	NaN	NaN	NaN

Practice</font >

Create a dataset with only public institutions. Name it pub

In [22]:

pub = colscd[colscd['ownership']=='Public']

The quantile( ) method computes quantiles from the data. (e.g., quantile(0.5) computes the median, or the the 50th quantile)

Let's look at a measure of the earnings spread for different institution types

a. Compute the 75th quantile for 'earn_10' for each 'type'.

b. Compute the 50th quantile for 'earn_10' for each 'type'.

c. Compute the 25th quantile for 'earn_10' for each 'type'.

In [23]:

q75 = pub.groupby('type')[['earn_10']].quantile(0.75)
q50 = pub.groupby('type')[['earn_10']].quantile(0.50)
q25 = pub.groupby('type')[['earn_10']].quantile(0.25)

2d. For each type, compute the difference between the 75 percentile and the 25 percentile. This is known as the intre-quartile range (IQR). Divide the IQR by the median. This gives us a notion of spread, standardized by the median.

In [24]:

spread = (q75 - q25)/ q50
spread

Out[24]:

	earn_10
type
asc	0.196
bach	0.266
cert	0.266
grad_only	0.917
na	0.199

Wow, a lot of dispersion in the "grad_only" group. Let's practice some more.

How do reading and writing scores correlate? Compute the median SAT reading and writing scores by state.

In [25]:

sat_rw_med = pub.groupby('state')[['sat_read_med', 'sat_write_med']].median()
sat_rw_med.head()

Out[25]:

	sat_read_med	sat_write_med
state
AK	575.0	510.0
AL	537.5	514.0
AR	517.0	450.0
AS	NaN	NaN
AZ	590.0	NaN

Create a scatter plot with the median reading score on the x axis and writing score on the y axis.

In [26]:

fig, ax = plt.subplots(figsize=(10,6))

ax.scatter(sat_rw_med['sat_read_med'], sat_rw_med['sat_write_med'], color='black', marker='x')
ax.plot([400, 600], [400, 600], color='black', linewidth=0.75)  # 45-degree line

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.set_xlabel('median SAT reading score')
ax.set_ylabel('median SAT writing score')

plt.show()

3. Extending groupby (top)¶

Working with quantiles¶

We can group the data into bins using the cut method.

In [27]:

bus_quart = pd.cut(colscd['sh_bus'], 3, right=False)
bus_quart[:10]

Out[27]:

name
Alabama A & M University               [0.0, 0.333)
University of Alabama at Birmingham    [0.0, 0.333)
Amridge University                     [0.0, 0.333)
University of Alabama in Huntsville    [0.0, 0.333)
Alabama State University               [0.0, 0.333)
The University of Alabama              [0.0, 0.333)
Central Alabama Community College      [0.0, 0.333)
Athens State University                [0.0, 0.333)
Auburn University at Montgomery        [0.0, 0.333)
Auburn University                      [0.0, 0.333)
Name: sh_bus, dtype: category
Categories (3, interval[float64, left]): [[0.0, 0.333) < [0.333, 0.667) < [0.667, 1.001)]

Remember, cut returns a Categorical object. We can use this object as our key variable in a groupby.

In [28]:

earn_bus = colscd.groupby(bus_quart)['earn_10']

Notice that we have used a key variable that was not a column of the DataFrame 'colscd'. The key variable can be very general. McKinney covers some other kinds of key variables.

We now have a DatFrameGroupedBy object. What is the distribution of institutions?

In [29]:

earn_bus.count()

Out[29]:

sh_bus
[0.0, 0.333)      3435
[0.333, 0.667)     631
[0.667, 1.001)     775
Name: earn_10, dtype: int64

Comment:

There is also a function called qcut. The difference between the two amounts to how the bins are constructed:

cut chooses the bins to be evenly spaced according to the values themselves and not the frequency of those values. Think of this as a tabular form of a histogram.

qcut chooses the bins such that the same number of records are in each bin. For example, if you have 30 records and ask for five bins (quantiles), there will be 6 observations in each bin.

Several statistics at once¶

Once we have grouped our data, we have been hitting it with methods to compute statistics: mean(), count(),...

We now introduced the agg( ) method, which lets us compute several moments at once --- you can even pass it a user defined function.

In [30]:

# This is the same as earn_bus.count()
earn_bus.agg('count')

Out[30]:

sh_bus
[0.0, 0.333)      3435
[0.333, 0.667)     631
[0.667, 1.001)     775
Name: earn_10, dtype: int64

In [31]:

# But agg() lets us compute many stats at once
earn_bus.agg(['count', 'mean', 'median', 'std'])

Out[31]:

	count	mean	median	std
sh_bus
[0.0, 0.333)	3435	35197.147	34300.0	12988.513
[0.333, 0.667)	631	31037.876	29400.0	8657.804
[0.667, 1.001)	775	33277.419	27900.0	15759.991

Schools that focus on business degrees don't seem to offer greater earnings opportunities.

Practice</font >

Write a function that returns the average of the 5 largest elements of a Series (a column of a DataFrame). Name the function 'avg5'. Go back to the lecture on user-defined functions to refresh your memory on how to create a function and read it into memory.

In [34]:

def avg5(x):
    top5 = x.sort_values(ascending=False)[:5]   # sort x, then grab the five largest. ".sort_values" defaults to ascending
    return top5.mean()            # return the mean 

Test your function on column 'a' of the DataFrame defined below. The answer should be 8.

test = pd.DataFrame({'a':[1, 4, 6, 9, 10, 3, 7, 8], 'b':[2, 3, 4, 5, 6, 7, 8, 10]

In [36]:

test = pd.DataFrame({'a':[1, 4, 6, 9, 10, 3, 7, 8], 'b':[2, 3, 4, 5, 6, 7, 8, 10] })
avg5(test['a'])

Out[36]:

8.0

Back to the College Scorecard. Use the .dropna() method to drop any observation that has 'debt_at_grad' == NaN. Use the subset argument to tell the computer to just drop observations where the debt_at_grad is NaN.

colscd.dropna(subset=['debt_at_grad'] , inplace=True)

In [34]:

colscd.dropna(subset=['debt_at_grad'] , inplace=True)   # Drop the missing values

Compute the mean, median, and avg5 'debt_at_grad' by 'ownership'. Compute them all at once using agg().

In [35]:

colscd.groupby('ownership')['debt_at_grad'].agg(['mean', 'median', avg5]) # Compute the summary stats

Out[35]:

	mean	median	avg5
ownership
Private nonprofit	22446.198	24750.0	43574.2
Private profit	14073.736	10556.0	45643.7
Public	14203.894	12000.0	34669.2

Create a function which calculates the average for the Top Ten schools. Redo (4) but now include this new statistic.

In [37]:

def avg10(x):
    top10 = x.sort_values(ascending=False)[:10]   # sort x, then grab the ten largest. ".sort_values" defaults to ascending
    return top10.mean()            # return the mean 

colscd.groupby('ownership')['debt_at_grad'].agg(['mean', 'median', avg5, avg10]) # Compute the summary stats

Out[37]:

	mean	median	avg5	avg10
ownership
Private nonprofit	22446.198	24750.0	43574.2	42062.10
Private profit	14073.736	10556.0	45643.7	40886.65
Public	14203.894	12000.0	34669.2	33170.80

Different Statistics for different variables¶

We may want to use different statistics for different variables. We can do that too by passing a dict to .agg(). First we list the variable we want to evaluate in the DataFrame and then we tell the computer which statistic to calculate.

In [41]:

colscd.groupby('ownership').agg({'earn_10':'median',
                                 'debt_at_grad':'median',
                                 'sh_las':'mean',
                                 'sh_bus':'mean'})

Out[41]:

	earn_10	debt_at_grad	sh_las	sh_bus
ownership
Private nonprofit	41700.0	24750.0	3.880e-02	0.212
Private profit	24800.0	10556.0	3.629e-04	0.312
Public	33400.0	12000.0	1.642e-01	0.250

And we can generate multiple statistics for a given variable too by passing a list of statistics to .agg for the corresponding variable.

In [43]:

colscd.groupby('ownership').agg({'earn_10':['mean','median'],
                                 'debt_at_grad':['mean','median'],
                                 'sh_las':'mean',
                                 'sh_bus':'mean'})

Out[43]:

	earn_10		debt_at_grad		sh_las	sh_bus
	mean	median	mean	median	mean	mean
ownership
Private nonprofit	43837.040	41700.0	22446.198	24750.0	3.880e-02	0.212
Private profit	27616.907	24800.0	14073.736	10556.0	3.629e-04	0.312
Public	36071.245	33400.0	14203.894	12000.0	1.642e-01	0.250

The columns are a MultiIndex which I thnk makes it hard to read. We can combine the MultiIndex (in pythonese "flatten") to create new column names which are easier to read (and informative).

In [45]:

df = colscd.groupby('ownership').agg({'earn_10':['mean','median'],
                                 'debt_at_grad':['mean','median'],
                                 'sh_las':'mean',
                                 'sh_bus':'mean'})

df.columns = [': '.join(col) for col in df.columns.values]
df.head()

Out[45]:

	earn_10: mean	earn_10: median	debt_at_grad: mean	debt_at_grad: median	sh_las: mean	sh_bus: mean
ownership
Private nonprofit	43837.040	41700.0	22446.198	24750.0	3.880e-02	0.212
Private profit	27616.907	24800.0	14073.736	10556.0	3.629e-04	0.312
Public	36071.245	33400.0	14203.894	12000.0	1.642e-01	0.250

Adding summary statistics to the original DataFrame.¶

We can also generate statistics and add them back to the original dataframe using the transform method. For instance, maybe we want to add the median score by ownership type to the original dataframe.

In [36]:

colscd['median by owner'] = colscd.groupby('ownership')['debt_at_grad'].transform('median')

Doing this enables us to compare observations to the relevant group statistic.

In [37]:

colscd['relative debt'] = colscd['debt_at_grad']/colscd['median by owner']

Let's see how we compare to similar institutions:

In [38]:

val = colscd.loc['University of Georgia']['relative debt']
print(f'The relative debt of the median UGA student is {val:4.2f} times the median value of public institutions.')

The relative debt of the median UGA student is 1.56 times the median value of public institutions.

Wow. That's quite the difference. Compare student debt with earnings to see the relative value of a college education. You could plot the results and look for patterns.

Lecture 12: Collapsing Data [SUGGESTED SOLUTIONS]

1. Load college scorecard data (top)¶

2. The "groupby" method (top)¶

Aggregating by group¶

groupby( ) with many keys¶

groupby( ) on a subset of columns¶

Grouping variables as index¶

Practice</font >

3. Extending groupby (top)¶

Working with quantiles¶

Several statistics at once¶

Practice</font >

Different Statistics for different variables¶

Adding summary statistics to the original DataFrame.¶

Jeff Thurk // jeff.thurk@uga.edu // Department of Economics // University of Georgia