Lecture 11: Reshaping Data [SUGGESTED SOLUTIONS]
In this notebook, we learn about multiple indexes, or 'MultiIndex', and reshaping data. The MultiIndex is a great way to organize our data and it provided an easy way to reshape our data: moving some variables from rows to columns and vice-versa.
When we can use MultiIndex to reshape data or two related methods pivot_table()
and melt()
which allow us to reshape data without using a MultiIndex. Sometimes, we want to reshape the data before we deal with indexing, and pivot and melt can save us some work.
Here's an outline of today's lecture:
1. Multiple Indexing (top)¶
We have covered some pandas basics. Now let's sort out how to deal with more complex data. We will often find ourselves with data in which the unit of observation is complex. Pandas helps us deal with this by allowing for many index variables. So far, we have only used single indexing, but that is about to change.
Some examples that could use a multiIndex:
- State and country
- Team and player
- Industry and firm
- Country (or person, firm,...) and time
That last one is important, and one that shows up a lot in economics. We call this panel data, or sometimes ''longitudinal'' data. Our data is a panel when we have observations of the same firm/person/country over time.
The key to working with more complex datasets is getting the index right. So far, we have considered a single index, but pandas allows for multiple indexes that nest each other.
Key concept: Hierarchical indexing takes multiple levels of indexes.
Let's set up the DataFrame to take team and position as the indexes.
import pandas as pd # load pandas and shorten it to pd
import datetime as dt # load datetime and shorten it to dt
import matplotlib.pyplot as plt # for making figures
soccer = {'team' : ['Man City', 'Man City', 'Man City', 'Man City', 'Chelsea', 'Chelsea'],
'player' : ['Walker', 'Stones', 'Foden', 'Jesus', 'Cahill', 'Pedro'],
'pos' : ['D', 'D', 'M', 'F', 'D', 'F'],
'goals' : [1, 0, 0, 1, 0, 3],
'assists': [0,0,0,0,0,0]
}
prem = pd.DataFrame(soccer)
prem.set_index(['team', 'pos'], inplace=True)
prem.head()
player | goals | assists | ||
---|---|---|---|---|
team | pos | |||
Man City | D | Walker | 1 | 0 |
D | Stones | 0 | 0 | |
M | Foden | 0 | 0 | |
F | Jesus | 1 | 0 | |
Chelsea | D | Cahill | 0 | 0 |
Wow.
Notice that the set_index()
method is the same one we used early with single indexes. In this case, we passed it a list of variables to make the indexes
prem.set_index(['team', 'pos'], inplace=True)
In the output, the highest level of the index is team (we passed it 'team' first in the list) and the second level is position. The output does not repeat the team name for each observation. The 'missing' team name just mean that the team is the same as above. [A very Tufte-esque removal of unnecessary ink.]
Let's take a look under the hood. What's our index? A new kind of object: the MultiIndex
print(prem.index)
MultiIndex([('Man City', 'D'), ('Man City', 'D'), ('Man City', 'M'), ('Man City', 'F'), ( 'Chelsea', 'D'), ( 'Chelsea', 'F')], names=['team', 'pos'])
Subsetting with multiple indexes¶
With a multi index, we need two arguments to reference observations
# All the defenders on Man City
prem.loc[('Man City', 'D'),:]
C:\Users\jt83241\AppData\Local\Temp\ipykernel_18916\2534838965.py:2: PerformanceWarning: indexing past lexsort depth may impact performance. prem.loc[('Man City', 'D'),:]
player | goals | assists | ||
---|---|---|---|---|
team | pos | |||
Man City | D | Walker | 1 | 0 |
D | Stones | 0 | 0 |
It's always a good idea to pay attention to warnings, particularly 'PerformanceWarning'. Pandas is telling us that we are asking for something in the second index, but the second index is not ordered. If the index was big, this could slow down our program. Let's fix that with sort_index()
.
Important Sort your mulitIndex.
prem = prem.sort_index(axis=0) # tell pandas which axis to sort. Could sort the columns, too...
# returns a DataFrame unless we use inplace=True
prem.head()
player | goals | assists | ||
---|---|---|---|---|
team | pos | |||
Chelsea | D | Cahill | 0 | 0 |
F | Pedro | 3 | 0 | |
Man City | D | Walker | 1 | 0 |
D | Stones | 0 | 0 | |
F | Jesus | 1 | 0 |
# Now let's ask for all the defenders on Man City
prem.loc[('Man City', 'D'), :]
player | goals | assists | ||
---|---|---|---|---|
team | pos | |||
Man City | D | Walker | 1 | 0 |
D | Stones | 0 | 0 |
No warnings. Yeah.
Partial indexing¶
With the indexes set, we can easily subset the data using only one of the indexes. In pandas, this is called partial indexing because we are only using part of the index to subset identify the data we want.
We can use loc[]
like we do with a single index if we want to index on the top level index.
print(prem.loc['Chelsea']) # All the 'Chelsea' observations
print('\n')
print(prem.loc['Man City']) # All the 'Man City' observations
player goals assists pos D Cahill 0 0 F Pedro 3 0 player goals assists pos D Walker 1 0 D Stones 0 0 F Jesus 1 0 M Foden 0 0
The xs() method¶
We can also use the xs()
method to return a "cross-setion" of the dataframe (ie, all observations with the index value we're interested in). Here we specify which level we are looking into. Note that I can reference the levels either by an integer or by name.
print(prem.xs('Chelsea', level = 0) ) # All the 'Chelsea' observations
print('\n')
print(prem.xs('Man City', level = 'team')) # All the 'Man City' observations
player goals assists pos D Cahill 0 0 F Pedro 3 0 player goals assists pos D Walker 1 0 D Stones 0 0 F Jesus 1 0 M Foden 0 0
We can partially index on the 'inner index' as well. Suppose we want all the defenders, regardless of team.
prem.xs('D', level=1) # pos (ie, position) is the second index
player | goals | assists | |
---|---|---|---|
team | |||
Chelsea | Cahill | 0 | 0 |
Man City | Walker | 1 | 0 |
Man City | Stones | 0 | 0 |
As before, we can get rid of the index and replace it with a generic list of integers...
prem.reset_index(inplace=True) # this moves the indexes back to columns
prem
team | pos | player | goals | assists | |
---|---|---|---|---|---|
0 | Chelsea | D | Cahill | 0 | 0 |
1 | Chelsea | F | Pedro | 3 | 0 |
2 | Man City | D | Walker | 1 | 0 |
3 | Man City | D | Stones | 0 | 0 |
4 | Man City | F | Jesus | 1 | 0 |
5 | Man City | M | Foden | 0 | 0 |
...and reset it with three levels of indexes!
prem.set_index(['team', 'player', 'pos'], inplace=True)
prem
goals | assists | |||
---|---|---|---|---|
team | player | pos | ||
Chelsea | Cahill | D | 0 | 0 |
Pedro | F | 3 | 0 | |
Man City | Walker | D | 1 | 0 |
Stones | D | 0 | 0 | |
Jesus | F | 1 | 0 | |
Foden | M | 0 | 0 |
A multiIndex in columns¶
There is nothing that says you can't have multiple indexes in the axis=1
dimension. Here is quick way to see this: transpose (math talk for flip) the DataFrame.
prem = prem.transpose() # this swaps the rows for columns
print(prem) # print() lines up the columns well
team Chelsea Man City player Cahill Pedro Walker Stones Jesus Foden pos D F D D F M goals 0 3 1 0 1 0 assists 0 0 0 0 0 0
Now the rows are named 'goals' and 'assists' and the columns are (team, player, pos). I'm not sure this is a very useful way to look at this particular dataset, but multiIndex columns can come in handy. Let's change it back.
prem = prem.transpose()
prem.head()
goals | assists | |||
---|---|---|---|---|
team | player | pos | ||
Chelsea | Cahill | D | 0 | 0 |
Pedro | F | 3 | 0 | |
Man City | Walker | D | 1 | 0 |
Stones | D | 0 | 0 | |
Jesus | F | 1 | 0 |
Summary statistics by level¶
MultiIndexes provide a quick way to summarize data. We will see many different ways to do this --- getting statistics by group --- and not all will involve a multiIndex.
# When subsetting by the upppermost level, I can use xs or loc
print('Chelsea avg. goals', prem.xs('Chelsea', level='team')['goals'].mean()) # average goals for Chelsea players
print('Chelsea avg. goals', prem.loc['Chelsea','goals'].mean()) # average goals for Chelsea players
# When subsetting on the inner levels, I use xs
print('Defender avg. goals {0:.2f}.'.format( prem.xs('D', level='pos')['goals'].mean() ) ) # average goals for defenders
Chelsea avg. goals 1.5 Chelsea avg. goals 1.5 Defender avg. goals 0.33.
C:\Users\jt83241\AppData\Local\Temp\ipykernel_18916\1303385782.py:3: PerformanceWarning: indexing past lexsort depth may impact performance. print('Chelsea avg. goals', prem.loc['Chelsea','goals'].mean()) # average goals for Chelsea players
Notice the syntax with xs.
prem.xs('Chelsea', level='team')['goals']
The prem.xs('Chelsea', level='team')
is returning a DataFrame with all the columns.
We then use the usual square-bracket syntax to pick off just the column 'goals' and then hit with mean()
Saving multiIndex DataFrames¶
Saving a multiIndexed DataFrame works like before. Pandas fills in all the repeated labels to the output is ready to go. Run the following code and then open the csv files.
# Multiple indexes on rows
prem.to_csv('prem.csv')
# Multiple indexes on columns
prem = prem.transpose()
prem.to_csv('prem_transposed.csv')
Practice: Multindex ¶
Take a few minutes and try the following. Feel free to chat with those around if you get stuck.
- Load
CPS_March_2016.csv
which contains wage data from the March 2016 Current Population Survey (CPS). There are some missing values so load the data withna_values='.'
cps = pd.read_csv('./Data/CPS_March_2016.csv',na_values = '.')
cps.head(20)
hrwage | educ | female | fulltimely | |
---|---|---|---|---|
0 | 20.961538 | Some college | 0 | 1.0 |
1 | 20.192308 | HS diploma/GED | 1 | 1.0 |
2 | 6.410256 | Some college | 0 | 0.0 |
3 | NaN | Less than HS | 0 | NaN |
4 | NaN | Some college | 0 | NaN |
5 | NaN | HS diploma/GED | 1 | NaN |
6 | 14.285714 | HS diploma/GED | 1 | 1.0 |
7 | 0.000000 | Some college | 0 | 0.0 |
8 | NaN | HS diploma/GED | 1 | NaN |
9 | NaN | College degree | 0 | NaN |
10 | 18.269230 | Some college | 1 | 1.0 |
11 | NaN | HS diploma/GED | 1 | NaN |
12 | 59.523810 | Graduate degree | 0 | 1.0 |
13 | 18.367348 | College degree | 1 | 1.0 |
14 | 8.653846 | HS diploma/GED | 0 | 1.0 |
15 | 59.136210 | Graduate degree | 0 | 1.0 |
16 | NaN | College degree | 1 | NaN |
17 | NaN | Less than HS | 1 | NaN |
18 | NaN | Less than HS | 0 | NaN |
19 | NaN | NaN | 1 | NaN |
- Keep only observations where the respondent worked full-time (
fulltimely == 1
).
cps = cps[cps['fulltimely'] == 1]
- Keep individuals with wages between \$5 and \\$200.
cps = cps[cps['hrwage'] <= 200]
cps = cps[cps['hrwage'] >= 5]
# Alternatively,
#cps = cps[(cps['hrwage'] >= 5)&((cps['hrwage'] <= 200))]
- Rename 'female' to 'sex'
cps.rename(columns={'female':'sex'}, inplace=True)
- In column 'sex' replace 0 with 'male' and 1 with 'female' using
.loc[]
cps.loc[cps['sex']==0, 'sex'] = 'male'
cps.loc[cps['sex']==1, 'sex'] = 'female'
cps.head(20)
hrwage | educ | sex | fulltimely | |
---|---|---|---|---|
0 | 20.961538 | Some college | male | 1.0 |
1 | 20.192308 | HS diploma/GED | female | 1.0 |
6 | 14.285714 | HS diploma/GED | female | 1.0 |
10 | 18.269230 | Some college | female | 1.0 |
12 | 59.523810 | Graduate degree | male | 1.0 |
13 | 18.367348 | College degree | female | 1.0 |
14 | 8.653846 | HS diploma/GED | male | 1.0 |
15 | 59.136210 | Graduate degree | male | 1.0 |
21 | 19.711538 | College degree | male | 1.0 |
22 | 22.349272 | College degree | female | 1.0 |
27 | 7.692308 | Some college | female | 1.0 |
28 | 20.054945 | College degree | male | 1.0 |
35 | 18.333334 | HS diploma/GED | male | 1.0 |
36 | 12.980769 | Some college | female | 1.0 |
37 | 21.634615 | HS diploma/GED | male | 1.0 |
40 | 18.452381 | Less than HS | male | 1.0 |
42 | 18.269230 | College degree | female | 1.0 |
45 | 20.879122 | HS diploma/GED | male | 1.0 |
49 | 11.057693 | HS diploma/GED | male | 1.0 |
53 | 8.653846 | Some college | male | 1.0 |
- Set the index to 'sex' and 'educ', in that order.
cps.set_index(['sex', 'educ'], inplace=True)
- Sort the index using
.sort_index()
cps.sort_index(axis=0, inplace=True)
cps.head(20)
hrwage | fulltimely | ||
---|---|---|---|
sex | educ | ||
female | College degree | 18.367348 | 1.0 |
College degree | 22.349272 | 1.0 | |
College degree | 18.269230 | 1.0 | |
College degree | 43.269230 | 1.0 | |
College degree | 19.230770 | 1.0 | |
College degree | 22.596153 | 1.0 | |
College degree | 29.807692 | 1.0 | |
College degree | 22.596153 | 1.0 | |
College degree | 15.277778 | 1.0 | |
College degree | 18.571428 | 1.0 | |
College degree | 17.211538 | 1.0 | |
College degree | 24.038462 | 1.0 | |
College degree | 21.634615 | 1.0 | |
College degree | 18.589743 | 1.0 | |
College degree | 16.239317 | 1.0 | |
College degree | 5.769231 | 1.0 | |
College degree | 25.641026 | 1.0 | |
College degree | 27.777779 | 1.0 | |
College degree | 15.384615 | 1.0 | |
College degree | 16.025640 | 1.0 |
- Report the average wage for males and females. Try it with the
loc[]
method.
avg_wage_f = cps.loc['female', 'hrwage'].mean()
avg_wage_m = cps.loc['male', 'hrwage'].mean()
print('Average wage of females is ${0:.2f} and males is ${1:.2f}.'.format(avg_wage_f, avg_wage_m) )
Average wage of females is $22.75 and males is $28.31.
- Report the average wage for
HS diploma/GED
and forCollege degree
, regardless of sex. Use thexs()
method.
avg_wage_high = cps.xs('HS diploma/GED', level='educ')['hrwage'].mean()
avg_wage_coll = cps.xs('College degree', level='educ')['hrwage'].mean()
print('Average wage of those with high school degrees is is ${0:.2f} and those with a college degree is ${1:.2f}.'.format(avg_wage_high, avg_wage_coll) )
Average wage of those with high school degrees is is $19.11 and those with a college degree is $31.96.
# load a data file with the number of walks and snacks my dogs have had
dogs = pd.read_csv('./Data/dogs.csv') # data on the habits of my dogs
dogs
dog | var | time | value | |
---|---|---|---|---|
0 | Buster | walk | 10/1/2018 | 2 |
1 | Buster | walk | 10/2/2018 | 2 |
2 | Buster | walk | 10/3/2018 | 1 |
3 | Buster | snacks | 10/1/2018 | 4 |
4 | Buster | snacks | 10/2/2018 | 3 |
5 | Buster | snacks | 10/3/2018 | 4 |
6 | Buster | snacks | 10/4/2018 | 5 |
7 | Su | walk | 10/1/2018 | 2 |
8 | Su | walk | 10/2/2018 | 2 |
9 | Su | walk | 10/3/2018 | 2 |
10 | Su | snacks | 10/1/2018 | 3 |
11 | Su | snacks | 10/2/2018 | 4 |
12 | Su | snacks | 10/3/2018 | 4 |
13 | Su | snacks | 10/4/2018 | 3 |
Cheat Sheet ¶
This data format is called long because there are lots of rows and not many columns. Moving between long and wide (lots of columns, fewer rows) is a common task in setting up panel data sets.
Pandas calls long data stacked and wide data unstacked.
With a MultiIndex:
We use the stack()
and unstack()
methods for moving between long and wide with MultiIndexed data. Stack and unstack do not work in place. They always return a copy, so we need to assign it to variables.
Without a MultiIndex:
We use the pivot_table()
and melt()
methods to move between long and wide data.
3. Reshaping From Long to Wide (top)¶
We start by pivoting the data to wide from long (the docs). It moves rows into columns.
Instead of referencing indexes, we are referencing columns in the long data. We tell pivot_table which column will be the index of the wide data, which columns will be the columns of the wide data, and which column hold the values of the columns.
# Make time variables datetime objects. Always a good practice.
dogs['time'] = pd.to_datetime(dogs['time'])
# Make th
dogs_wide = dogs.pivot_table(index='time', columns=['dog', 'var'], values='value')
dogs_wide
dog | Buster | Su | ||
---|---|---|---|---|
var | snacks | walk | snacks | walk |
time | ||||
2018-10-01 | 4.0 | 2.0 | 3.0 | 2.0 |
2018-10-02 | 3.0 | 2.0 | 4.0 | 2.0 |
2018-10-03 | 4.0 | 1.0 | 4.0 | 2.0 |
2018-10-04 | 5.0 | NaN | 3.0 | NaN |
A nice result. We basically jumped right to a panel. Notice that the order in which we pass columns determines the level of the MultiIndex.
Now that I have a MultiIndexed 'wide' DataFrame, I turn it into a 'long' DataFrame frame using stack()
, since stack()
knows what to do with multiple indexes.
dogs_long = dogs_wide.stack(['dog', 'var'])
dogs_long
time dog var 2018-10-01 Buster snacks 4.0 walk 2.0 Su snacks 3.0 walk 2.0 2018-10-02 Buster snacks 3.0 walk 2.0 Su snacks 4.0 walk 2.0 2018-10-03 Buster snacks 4.0 walk 1.0 Su snacks 4.0 walk 2.0 2018-10-04 Buster snacks 5.0 Su snacks 3.0 dtype: float64
onebeds = pd.read_csv('./Data/Metro_MedianRentalPrice_1Bedroom.csv')
onebeds.head()
RegionName | SizeRank | 2010-02 | 2010-03 | 2010-04 | 2010-05 | 2010-06 | 2010-07 | 2010-08 | 2010-09 | ... | 2017-11 | 2017-12 | 2018-01 | 2018-02 | 2018-03 | 2018-04 | 2018-05 | 2018-06 | 2018-07 | 2018-08 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | United States | 0 | NaN | 1200.0 | 1245.0 | 1200.0 | 1250.0 | 1200.0 | 1200.0 | 1044.0 | ... | 1400.0 | 1400.0 | 1400.0 | 1405.0 | 1425.0 | 1434.0 | 1450.0 | 1450.0 | 1460.0 | 1460.0 |
1 | New York, NY | 1 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | ... | 2095.0 | 2050.0 | 2050.0 | 2085.0 | 2100.0 | 2135.0 | 2150.0 | 2192.0 | 2155.0 | 2153.0 |
2 | Los Angeles, CA | 2 | NaN | 1380.0 | 1425.0 | 1420.0 | 1437.5 | 1550.0 | 1600.0 | 1388.0 | ... | 1961.0 | 1947.0 | 1930.0 | 1950.0 | 1960.0 | 1950.0 | 1995.0 | 2026.0 | 2050.0 | 2095.0 |
3 | Chicago, IL | 3 | NaN | 1050.0 | 1100.0 | 1050.0 | 1075.0 | 995.0 | 950.0 | 1090.0 | ... | 1525.0 | 1495.0 | 1477.0 | 1495.0 | 1495.0 | 1495.0 | 1495.0 | 1495.0 | 1500.0 | 1500.0 |
4 | Dallas, TX | 4 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 1000.0 | ... | 1217.0 | 1205.0 | 1200.0 | 1242.0 | 1235.0 | 1235.0 | 1252.0 | 1255.0 | 1267.0 | 1266.0 |
5 rows × 105 columns
These data are wide. There is only one variable here: the median rental price of a one bedroom apartment.
Use melt()
to create a long DataFrame. We provide melt with the id_vars, which tell it the variables to group the data by.
onebeds_long = onebeds.melt(id_vars=['RegionName', 'SizeRank'])
onebeds_long.head()
RegionName | SizeRank | variable | value | |
---|---|---|---|---|
0 | United States | 0 | 2010-02 | NaN |
1 | New York, NY | 1 | 2010-02 | NaN |
2 | Los Angeles, CA | 2 | 2010-02 | NaN |
3 | Chicago, IL | 3 | 2010-02 | NaN |
4 | Dallas, TX | 4 | 2010-02 | NaN |
Let's clean this up a bit.
onebeds_long.rename(columns={'variable':'date'},inplace=True)
onebeds_long['date'] = pd.to_datetime(onebeds_long['date'])
onebeds_long.head()
RegionName | SizeRank | date | value | |
---|---|---|---|---|
0 | United States | 0 | 2010-02-01 | NaN |
1 | New York, NY | 1 | 2010-02-01 | NaN |
2 | Los Angeles, CA | 2 | 2010-02-01 | NaN |
3 | Chicago, IL | 3 | 2010-02-01 | NaN |
4 | Dallas, TX | 4 | 2010-02-01 | NaN |
I'm a little surprised this didn't create a MultiIndex on the row axis, like pivot( ) did on the column axis above. We can certainly turn it into a MultiIndex...
onebeds_long.set_index(['RegionName', 'date'], inplace=True)
onebeds_long.head()
SizeRank | value | ||
---|---|---|---|
RegionName | date | ||
United States | 2010-02-01 | 0 | NaN |
New York, NY | 2010-02-01 | 1 | NaN |
Los Angeles, CA | 2010-02-01 | 2 | NaN |
Chicago, IL | 2010-02-01 | 3 | NaN |
Dallas, TX | 2010-02-01 | 4 | NaN |
# might as well clean up the column names
onebeds_long.rename(columns={'value':'onebed_price', 'SizeRank':'size_rank'}, inplace=True)
onebeds_long.head()
size_rank | onebed_price | ||
---|---|---|---|
RegionName | date | ||
United States | 2010-02-01 | 0 | NaN |
New York, NY | 2010-02-01 | 1 | NaN |
Los Angeles, CA | 2010-02-01 | 2 | NaN |
Chicago, IL | 2010-02-01 | 3 | NaN |
Dallas, TX | 2010-02-01 | 4 | NaN |
# and the index names
onebeds_long.index.set_names('region', level=0, inplace=True)
onebeds_long.index.set_names('date', level=1, inplace=True)
onebeds_long.head()
size_rank | onebed_price | ||
---|---|---|---|
region | date | ||
United States | 2010-02-01 | 0 | NaN |
New York, NY | 2010-02-01 | 1 | NaN |
Los Angeles, CA | 2010-02-01 | 2 | NaN |
Chicago, IL | 2010-02-01 | 3 | NaN |
Dallas, TX | 2010-02-01 | 4 | NaN |
Let's visualize one-bedroom prices in Atlanta using our timeseries and matplotlib skills from previous lectures.
# Peel off the data for Atlanta
ga = onebeds_long.loc['Atlanta, GA']
avg_onebed = ga['onebed_price'].mean()
fig, ax = plt.subplots(figsize=(15,8))
# add the plot
ax.plot(ga.index, ga['onebed_price'], color='red')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_title('Median Rental Prices: Atlanta, GA',fontsize=24)
ax.set_ylabel('Current Dollars',fontsize=18)
ax.set_ylim(700,1400)
ax.text(dt.datetime(2017,10,10), 1110, 'One Bedroom\n($'+ f'{round(avg_onebed,2):,}' +')' ,ha='center',fontsize=18)
# Let's add some commas to the y-axis using a list comprehension.
# The argument "ax.get_yticks().tolist()" is getting the yticks and converting them to a list.
# The rest is a list comprehension which converts each element to an integer with commas
# as the thousands separator. Note that '{:,}'.format(int(x)) is the same as when you
# want to print out a number in a sentence.
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()],size=16)
plt.xticks(fontsize=16)
plt.show()
C:\Users\jt83241\AppData\Local\Temp\ipykernel_18916\2808462175.py:24: UserWarning: FixedFormatter should only be used together with FixedLocator ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()],size=16)
Practice: Reshaping¶
Let's try some more reshaping.
- Load 'broadband.csv'. The file contains gdp per capita and broadband penetration data for several countries. Are the data long or wide?
bb = pd.read_csv('./Data/broadband.csv')
bb.head(10)
country | series | value | |
---|---|---|---|
0 | Switzerland | broadband | 46.77000 |
1 | Switzerland | gdppc | 64834.80053 |
2 | Denmark | broadband | 42.92300 |
3 | Denmark | gdppc | 51495.56007 |
4 | France | broadband | 42.84300 |
5 | France | gdppc | 42858.45494 |
6 | Netherlands | broadband | 42.12600 |
7 | Netherlands | gdppc | 52799.14951 |
8 | Korea | broadband | 41.20000 |
9 | Korea | gdppc | 38350.32339 |
- Set country and series as the index.
bb = bb.set_index(['country', 'series'])
bb.head()
value | ||
---|---|---|
country | series | |
Switzerland | broadband | 46.77000 |
gdppc | 64834.80053 | |
Denmark | broadband | 42.92300 |
gdppc | 51495.56007 | |
France | broadband | 42.84300 |
- Create a DataFrame with broadband and gdppc as columns, country as rows.
bb = bb.unstack()
bb.head()
value | ||
---|---|---|
series | broadband | gdppc |
country | ||
Australia | 31.796 | 50588.14949 |
Austria | 28.543 | 52467.52700 |
Belgium | 38.588 | 47941.66117 |
Canada | 37.847 | 46704.89203 |
Chile | 16.515 | 24012.91487 |
- Reload the broadband data file to a new DataFrame.
bb2 = pd.read_csv('./Data/broadband.csv')
- Create the DataFrame you have for part 3., but do it using
melt
orpivot_table
.
bb2 = bb2.pivot_table(index=['country'], columns=['series'], values = 'value') # pivot_table is used for long to wide
bb2.head(10)
series | broadband | gdppc |
---|---|---|
country | ||
Australia | 31.796 | 50588.14949 |
Austria | 28.543 | 52467.52700 |
Belgium | 38.588 | 47941.66117 |
Canada | 37.847 | 46704.89203 |
Chile | 16.515 | 24012.91487 |
Colombia | 12.821 | 14485.30586 |
Czech Republic | 29.636 | 36350.10463 |
Denmark | 42.923 | 51495.56007 |
Estonia | 29.824 | 31739.11999 |
Finland | 31.036 | 44955.58312 |
- Make a scatter plot of gdp per capita against broadband penetration.
fig, ax = plt.subplots(figsize=(15,10))
ax.scatter(bb2['broadband'], bb2['gdppc'], color='red')
ax.set_ylabel('GDP per capita',size=18)
ax.set_xlabel('Broadband penetration',size=18)
ax.set_title('Income and broadband useage in OECD countries',size=24)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
ax.set_ylim(0,120000)
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()],size=16)
plt.xticks(fontsize=16)
plt.show()
C:\Users\jt83241\AppData\Local\Temp\ipykernel_18916\1915233337.py:13: UserWarning: FixedFormatter should only be used together with FixedLocator ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()],size=16)
- Save your DataFrame as 'broadband_wide.csv'.
bb2.to_csv('broadband_wide.csv')
- Read 'broadband_wide.csv' into a DataFrame named
bb_wide
.
bb_wide = pd.read_csv('broadband_wide.csv')
bb_wide.head()
country | broadband | gdppc | |
---|---|---|---|
0 | Australia | 31.796 | 50588.14949 |
1 | Austria | 28.543 | 52467.52700 |
2 | Belgium | 38.588 | 47941.66117 |
3 | Canada | 37.847 | 46704.89203 |
4 | Chile | 16.515 | 24012.91487 |
- Turn 'bb_wide' into a long dataset using
pivot_table
ormelt
.
bb_long = bb_wide.melt(id_vars=['country'])
bb_long.head()
country | variable | value | |
---|---|---|---|
0 | Australia | broadband | 31.796 |
1 | Austria | broadband | 28.543 |
2 | Belgium | broadband | 38.588 |
3 | Canada | broadband | 37.847 |
4 | Chile | broadband | 16.515 |