Lecture 11: Reshaping Data [SUGGESTED SOLUTIONS]

In this notebook, we learn about multiple indexes, or 'MultiIndex', and reshaping data. The MultiIndex is a great way to organize our data and it provided an easy way to reshape our data: moving some variables from rows to columns and vice-versa.

When we can use MultiIndex to reshape data or two related methods pivot_table() and melt() which allow us to reshape data without using a MultiIndex. Sometimes, we want to reshape the data before we deal with indexing, and pivot and melt can save us some work.

Here's an outline of today's lecture:

Multindex

Panel Data

Reshaping From Long to Wide

Reshaping From Wide to Long

1. Multiple Indexing (top)¶

We have covered some pandas basics. Now let's sort out how to deal with more complex data. We will often find ourselves with data in which the unit of observation is complex. Pandas helps us deal with this by allowing for many index variables. So far, we have only used single indexing, but that is about to change.

Some examples that could use a multiIndex:

State and country
Team and player
Industry and firm
Country (or person, firm,...) and time

That last one is important, and one that shows up a lot in economics. We call this panel data, or sometimes ''longitudinal'' data. Our data is a panel when we have observations of the same firm/person/country over time.

The key to working with more complex datasets is getting the index right. So far, we have considered a single index, but pandas allows for multiple indexes that nest each other.

Key concept: Hierarchical indexing takes multiple levels of indexes.

Let's set up the DataFrame to take team and position as the indexes.

In [1]:

import pandas as pd                 # load pandas and shorten it to pd
import datetime as dt               # load datetime and shorten it to dt
import matplotlib.pyplot as plt     # for making figures

soccer = {'team' : ['Man City', 'Man City', 'Man City', 'Man City', 'Chelsea', 'Chelsea'], 
          'player' : ['Walker', 'Stones', 'Foden', 'Jesus', 'Cahill', 'Pedro'],
          'pos' : ['D', 'D', 'M', 'F', 'D', 'F'],
          'goals' : [1, 0, 0, 1, 0, 3],
          'assists': [0,0,0,0,0,0]
         }

prem = pd.DataFrame(soccer)

prem.set_index(['team', 'pos'], inplace=True)
prem.head()

Out[1]:

		player	goals	assists
team	pos
Man City	D	Walker	1	0
	D	Stones	0	0
	M	Foden	0	0
	F	Jesus	1	0
Chelsea	D	Cahill	0	0

Wow.

Notice that the set_index() method is the same one we used early with single indexes. In this case, we passed it a list of variables to make the indexes

prem.set_index(['team', 'pos'], inplace=True)

In the output, the highest level of the index is team (we passed it 'team' first in the list) and the second level is position. The output does not repeat the team name for each observation. The 'missing' team name just mean that the team is the same as above. [A very Tufte-esque removal of unnecessary ink.]

Let's take a look under the hood. What's our index? A new kind of object: the MultiIndex

In [2]:

print(prem.index)

MultiIndex([('Man City', 'D'),
            ('Man City', 'D'),
            ('Man City', 'M'),
            ('Man City', 'F'),
            ( 'Chelsea', 'D'),
            ( 'Chelsea', 'F')],
           names=['team', 'pos'])

Subsetting with multiple indexes¶

With a multi index, we need two arguments to reference observations

In [3]:

# All the defenders on Man City
prem.loc[('Man City', 'D'),:] 

C:\Users\jt83241\AppData\Local\Temp\ipykernel_18916\2534838965.py:2: PerformanceWarning: indexing past lexsort depth may impact performance.
  prem.loc[('Man City', 'D'),:]

Out[3]:

		player	goals	assists
team	pos
Man City	D	Walker	1	0
Man City	D	Stones	0	0

It's always a good idea to pay attention to warnings, particularly 'PerformanceWarning'. Pandas is telling us that we are asking for something in the second index, but the second index is not ordered. If the index was big, this could slow down our program. Let's fix that with sort_index().

Important Sort your mulitIndex.

In [4]:

prem = prem.sort_index(axis=0)   # tell pandas which axis to sort. Could sort the columns, too...
                                 # returns a DataFrame unless we use inplace=True
prem.head()

Out[4]:

		player	goals	assists
team	pos
Chelsea	D	Cahill	0	0
Chelsea	F	Pedro	3	0
Man City	D	Walker	1	0
	D	Stones	0	0
	F	Jesus	1	0

In [5]:

# Now let's ask for all the defenders on Man City
prem.loc[('Man City', 'D'), :]

Out[5]:

		player	goals	assists
team	pos
Man City	D	Walker	1	0
Man City	D	Stones	0	0

No warnings. Yeah.

Partial indexing¶

With the indexes set, we can easily subset the data using only one of the indexes. In pandas, this is called partial indexing because we are only using part of the index to subset identify the data we want.

We can use loc[] like we do with a single index if we want to index on the top level index.

In [6]:

print(prem.loc['Chelsea'])               # All the 'Chelsea' observations
print('\n')
print(prem.loc['Man City'])              # All the 'Man City' observations

     player  goals  assists
pos                        
D    Cahill      0        0
F     Pedro      3        0


     player  goals  assists
pos                        
D    Walker      1        0
D    Stones      0        0
F     Jesus      1        0
M     Foden      0        0

The xs() method¶

We can also use the xs() method to return a "cross-setion" of the dataframe (ie, all observations with the index value we're interested in). Here we specify which level we are looking into. Note that I can reference the levels either by an integer or by name.

In [7]:

print(prem.xs('Chelsea', level = 0) )              # All the 'Chelsea' observations
print('\n')
print(prem.xs('Man City', level = 'team'))          # All the 'Man City' observations

     player  goals  assists
pos                        
D    Cahill      0        0
F     Pedro      3        0


     player  goals  assists
pos                        
D    Walker      1        0
D    Stones      0        0
F     Jesus      1        0
M     Foden      0        0

We can partially index on the 'inner index' as well. Suppose we want all the defenders, regardless of team.

In [8]:

prem.xs('D', level=1)  # pos (ie, position) is the second index

Out[8]:

	player	goals	assists
team
Chelsea	Cahill	0	0
Man City	Walker	1	0
Man City	Stones	0	0

As before, we can get rid of the index and replace it with a generic list of integers...

In [9]:

prem.reset_index(inplace=True)    # this moves the indexes back to columns
prem

Out[9]:

	team	pos	player	goals
0	Chelsea	D	Cahill	0
1	Chelsea	F	Pedro	3
2	Man City	D	Walker	1
3	Man City	D	Stones	0
4	Man City	F	Jesus	1
5	Man City	M	Foden	0

...and reset it with three levels of indexes!

In [10]:

prem.set_index(['team', 'player', 'pos'], inplace=True)
prem

Out[10]:

			goals	assists
team	player	pos
Chelsea	Cahill	D	0	0
Chelsea	Pedro	F	3	0
Man City	Walker	D	1	0
	Stones	D	0	0
	Jesus	F	1	0
	Foden	M	0	0

A multiIndex in columns¶

There is nothing that says you can't have multiple indexes in the axis=1 dimension. Here is quick way to see this: transpose (math talk for flip) the DataFrame.

In [11]:

prem = prem.transpose()           # this swaps the rows for columns
print(prem)                       # print() lines up the columns well

team    Chelsea       Man City                   
player   Cahill Pedro   Walker Stones Jesus Foden
pos           D     F        D      D     F     M
goals         0     3        1      0     1     0
assists       0     0        0      0     0     0

Now the rows are named 'goals' and 'assists' and the columns are (team, player, pos). I'm not sure this is a very useful way to look at this particular dataset, but multiIndex columns can come in handy. Let's change it back.

In [12]:

prem = prem.transpose()
prem.head()

Out[12]:

			goals	assists
team	player	pos
Chelsea	Cahill	D	0	0
Chelsea	Pedro	F	3	0
Man City	Walker	D	1	0
	Stones	D	0	0
	Jesus	F	1	0

Summary statistics by level¶

MultiIndexes provide a quick way to summarize data. We will see many different ways to do this --- getting statistics by group --- and not all will involve a multiIndex.

In [13]:

# When subsetting by the upppermost level, I can use xs or loc
print('Chelsea avg. goals', prem.xs('Chelsea', level='team')['goals'].mean())   # average goals for Chelsea players
print('Chelsea avg. goals', prem.loc['Chelsea','goals'].mean())   # average goals for Chelsea players

# When subsetting on the inner levels, I use xs 
print('Defender avg. goals {0:.2f}.'.format( prem.xs('D', level='pos')['goals'].mean() ) )          # average goals for defenders

Chelsea avg. goals 1.5
Chelsea avg. goals 1.5
Defender avg. goals 0.33.

C:\Users\jt83241\AppData\Local\Temp\ipykernel_18916\1303385782.py:3: PerformanceWarning: indexing past lexsort depth may impact performance.
  print('Chelsea avg. goals', prem.loc['Chelsea','goals'].mean())   # average goals for Chelsea players

Notice the syntax with xs.

prem.xs('Chelsea', level='team')['goals']

The prem.xs('Chelsea', level='team') is returning a DataFrame with all the columns.

We then use the usual square-bracket syntax to pick off just the column 'goals' and then hit with mean()

Saving multiIndex DataFrames¶

Saving a multiIndexed DataFrame works like before. Pandas fills in all the repeated labels to the output is ready to go. Run the following code and then open the csv files.

In [14]:

# Multiple indexes on rows
prem.to_csv('prem.csv')

# Multiple indexes on columns
prem = prem.transpose()
prem.to_csv('prem_transposed.csv')

Practice: Multindex ¶

Take a few minutes and try the following. Feel free to chat with those around if you get stuck.

Load CPS_March_2016.csv which contains wage data from the March 2016 Current Population Survey (CPS). There are some missing values so load the data with na_values='.'

In [15]:

cps = pd.read_csv('./Data/CPS_March_2016.csv',na_values = '.')

cps.head(20)

Out[15]:

	hrwage	educ	female	fulltimely
0	20.961538	Some college	0	1.0
1	20.192308	HS diploma/GED	1	1.0
2	6.410256	Some college	0	0.0
3	NaN	Less than HS	0	NaN
4	NaN	Some college	0	NaN
5	NaN	HS diploma/GED	1	NaN
6	14.285714	HS diploma/GED	1	1.0
7	0.000000	Some college	0	0.0
8	NaN	HS diploma/GED	1	NaN
9	NaN	College degree	0	NaN
10	18.269230	Some college	1	1.0
11	NaN	HS diploma/GED	1	NaN
12	59.523810	Graduate degree	0	1.0
13	18.367348	College degree	1	1.0
14	8.653846	HS diploma/GED	0	1.0
15	59.136210	Graduate degree	0	1.0
16	NaN	College degree	1	NaN
17	NaN	Less than HS	1	NaN
18	NaN	Less than HS	0	NaN
19	NaN	NaN	1	NaN

Keep only observations where the respondent worked full-time (fulltimely == 1).

In [16]:

cps = cps[cps['fulltimely'] == 1]

Keep individuals with wages between \$5 and \\$200.

In [17]:

cps = cps[cps['hrwage'] <= 200]
cps = cps[cps['hrwage'] >= 5]

# Alternatively,
#cps = cps[(cps['hrwage'] >= 5)&((cps['hrwage'] <= 200))]

Rename 'female' to 'sex'

In [18]:

cps.rename(columns={'female':'sex'}, inplace=True)

In column 'sex' replace 0 with 'male' and 1 with 'female' using .loc[]

In [19]:

cps.loc[cps['sex']==0, 'sex'] = 'male'
cps.loc[cps['sex']==1, 'sex'] = 'female'
cps.head(20)

Out[19]:

	hrwage	educ	sex	fulltimely
0	20.961538	Some college	male	1.0
1	20.192308	HS diploma/GED	female	1.0
6	14.285714	HS diploma/GED	female	1.0
10	18.269230	Some college	female	1.0
12	59.523810	Graduate degree	male	1.0
13	18.367348	College degree	female	1.0
14	8.653846	HS diploma/GED	male	1.0
15	59.136210	Graduate degree	male	1.0
21	19.711538	College degree	male	1.0
22	22.349272	College degree	female	1.0
27	7.692308	Some college	female	1.0
28	20.054945	College degree	male	1.0
35	18.333334	HS diploma/GED	male	1.0
36	12.980769	Some college	female	1.0
37	21.634615	HS diploma/GED	male	1.0
40	18.452381	Less than HS	male	1.0
42	18.269230	College degree	female	1.0
45	20.879122	HS diploma/GED	male	1.0
49	11.057693	HS diploma/GED	male	1.0
53	8.653846	Some college	male	1.0

Set the index to 'sex' and 'educ', in that order.

In [20]:

cps.set_index(['sex', 'educ'], inplace=True)

Sort the index using .sort_index()

In [21]:

cps.sort_index(axis=0, inplace=True)

cps.head(20)

Out[21]:

		hrwage	fulltimely
sex	educ
female	College degree	18.367348	1.0
	College degree	22.349272	1.0
	College degree	18.269230	1.0
	College degree	43.269230	1.0
	College degree	19.230770	1.0
	College degree	22.596153	1.0
	College degree	29.807692	1.0
	College degree	22.596153	1.0
	College degree	15.277778	1.0
	College degree	18.571428	1.0
	College degree	17.211538	1.0
	College degree	24.038462	1.0
	College degree	21.634615	1.0
	College degree	18.589743	1.0
	College degree	16.239317	1.0
	College degree	5.769231	1.0
	College degree	25.641026	1.0
	College degree	27.777779	1.0
	College degree	15.384615	1.0
	College degree	16.025640	1.0

Report the average wage for males and females. Try it with the loc[] method.

In [22]:

avg_wage_f = cps.loc['female', 'hrwage'].mean()
avg_wage_m = cps.loc['male', 'hrwage'].mean()

print('Average wage of females is ${0:.2f} and males is ${1:.2f}.'.format(avg_wage_f, avg_wage_m) )

Average wage of females is $22.75 and males is $28.31.

Report the average wage for HS diploma/GED and for College degree, regardless of sex. Use the xs() method.

In [23]:

avg_wage_high = cps.xs('HS diploma/GED', level='educ')['hrwage'].mean()
avg_wage_coll = cps.xs('College degree', level='educ')['hrwage'].mean()

print('Average wage of those with high school degrees is is ${0:.2f} and those with a college degree is ${1:.2f}.'.format(avg_wage_high, avg_wage_coll) )

Average wage of those with high school degrees is is $19.11 and those with a college degree is $31.96.

2. Panel data (top)¶

Panel data have (at least) three dimensions, for example, a panel might record

several variables (income, age, spending on cheese)
regarding several people
over time

Let's load data on my friend Kim's dogs.

In [24]:

# load a data file with the number of walks and snacks my dogs have had 
dogs = pd.read_csv('./Data/dogs.csv')         # data on the habits of my dogs
dogs

Out[24]:

	dog	var	time	value
0	Buster	walk	10/1/2018	2
1	Buster	walk	10/2/2018	2
2	Buster	walk	10/3/2018	1
3	Buster	snacks	10/1/2018	4
4	Buster	snacks	10/2/2018	3
5	Buster	snacks	10/3/2018	4
6	Buster	snacks	10/4/2018	5
7	Su	walk	10/1/2018	2
8	Su	walk	10/2/2018	2
9	Su	walk	10/3/2018	2
10	Su	snacks	10/1/2018	3
11	Su	snacks	10/2/2018	4
12	Su	snacks	10/3/2018	4
13	Su	snacks	10/4/2018	3

Cheat Sheet ¶

This data format is called long because there are lots of rows and not many columns. Moving between long and wide (lots of columns, fewer rows) is a common task in setting up panel data sets.

Pandas calls long data stacked and wide data unstacked.

With a MultiIndex: We use the stack() and unstack() methods for moving between long and wide with MultiIndexed data. Stack and unstack do not work in place. They always return a copy, so we need to assign it to variables.

Without a MultiIndex: We use the pivot_table() and melt() methods to move between long and wide data.

3. Reshaping From Long to Wide (top)¶

We start by pivoting the data to wide from long (the docs). It moves rows into columns.

Instead of referencing indexes, we are referencing columns in the long data. We tell pivot_table which column will be the index of the wide data, which columns will be the columns of the wide data, and which column hold the values of the columns.

In [25]:

# Make time variables datetime objects. Always a good practice.
dogs['time'] = pd.to_datetime(dogs['time'])

# Make th 
dogs_wide = dogs.pivot_table(index='time', columns=['dog', 'var'], values='value')
dogs_wide

Out[25]:

dog	Buster		Su
var	snacks	walk	snacks	walk
time
2018-10-01	4.0	2.0	3.0	2.0
2018-10-02	3.0	2.0	4.0	2.0
2018-10-03	4.0	1.0	4.0	2.0
2018-10-04	5.0	NaN	3.0	NaN

A nice result. We basically jumped right to a panel. Notice that the order in which we pass columns determines the level of the MultiIndex.

Now that I have a MultiIndexed 'wide' DataFrame, I turn it into a 'long' DataFrame frame using stack(), since stack() knows what to do with multiple indexes.

In [26]:

dogs_long = dogs_wide.stack(['dog', 'var'])
dogs_long

Out[26]:

time        dog     var   
2018-10-01  Buster  snacks    4.0
                    walk      2.0
            Su      snacks    3.0
                    walk      2.0
2018-10-02  Buster  snacks    3.0
                    walk      2.0
            Su      snacks    4.0
                    walk      2.0
2018-10-03  Buster  snacks    4.0
                    walk      1.0
            Su      snacks    4.0
                    walk      2.0
2018-10-04  Buster  snacks    5.0
            Su      snacks    3.0
dtype: float64

4. Reshaping From Wide to Long (top)¶

The melt() method gets us from a wide DataFrame to a long DataFrame (the docs). It moves columns into rows.

Let's use the data from Zillow on one bedroom rental prices by metro area, 'Metro_Median_RentalPrice_1Bedroom.csv'.

In [27]:

onebeds = pd.read_csv('./Data/Metro_MedianRentalPrice_1Bedroom.csv')
onebeds.head()

Out[27]:

	RegionName	SizeRank	2010-02	2010-03	2010-04	2010-05	2010-06	2010-07	2010-08	2010-09	...	2017-11	2017-12	2018-01	2018-02	2018-03	2018-04	2018-05	2018-06	2018-07	2018-08
0	United States	0	NaN	1200.0	1245.0	1200.0	1250.0	1200.0	1200.0	1044.0	...	1400.0	1400.0	1400.0	1405.0	1425.0	1434.0	1450.0	1450.0	1460.0	1460.0
1	New York, NY	1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	2095.0	2050.0	2050.0	2085.0	2100.0	2135.0	2150.0	2192.0	2155.0	2153.0
2	Los Angeles, CA	2	NaN	1380.0	1425.0	1420.0	1437.5	1550.0	1600.0	1388.0	...	1961.0	1947.0	1930.0	1950.0	1960.0	1950.0	1995.0	2026.0	2050.0	2095.0
3	Chicago, IL	3	NaN	1050.0	1100.0	1050.0	1075.0	995.0	950.0	1090.0	...	1525.0	1495.0	1477.0	1495.0	1495.0	1495.0	1495.0	1495.0	1500.0	1500.0
4	Dallas, TX	4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	1000.0	...	1217.0	1205.0	1200.0	1242.0	1235.0	1235.0	1252.0	1255.0	1267.0	1266.0

5 rows × 105 columns

These data are wide. There is only one variable here: the median rental price of a one bedroom apartment.

Use melt() to create a long DataFrame. We provide melt with the id_vars, which tell it the variables to group the data by.

In [28]:

onebeds_long = onebeds.melt(id_vars=['RegionName', 'SizeRank']) 
onebeds_long.head()

Out[28]:

	RegionName	SizeRank	variable	value
0	United States	0	2010-02	NaN
1	New York, NY	1	2010-02	NaN
2	Los Angeles, CA	2	2010-02	NaN
3	Chicago, IL	3	2010-02	NaN
4	Dallas, TX	4	2010-02	NaN

Let's clean this up a bit.

In [29]:

onebeds_long.rename(columns={'variable':'date'},inplace=True)
onebeds_long['date'] = pd.to_datetime(onebeds_long['date'])
onebeds_long.head()

Out[29]:

	RegionName	SizeRank	date	value
0	United States	0	2010-02-01	NaN
1	New York, NY	1	2010-02-01	NaN
2	Los Angeles, CA	2	2010-02-01	NaN
3	Chicago, IL	3	2010-02-01	NaN
4	Dallas, TX	4	2010-02-01	NaN

I'm a little surprised this didn't create a MultiIndex on the row axis, like pivot( ) did on the column axis above. We can certainly turn it into a MultiIndex...

In [30]:

onebeds_long.set_index(['RegionName', 'date'], inplace=True)
onebeds_long.head()

Out[30]:

		SizeRank	value
RegionName	date
United States	2010-02-01	0	NaN
New York, NY	2010-02-01	1	NaN
Los Angeles, CA	2010-02-01	2	NaN
Chicago, IL	2010-02-01	3	NaN
Dallas, TX	2010-02-01	4	NaN

In [31]:

# might as well clean up the column names
onebeds_long.rename(columns={'value':'onebed_price', 'SizeRank':'size_rank'}, inplace=True)
onebeds_long.head()

Out[31]:

		size_rank	onebed_price
RegionName	date
United States	2010-02-01	0	NaN
New York, NY	2010-02-01	1	NaN
Los Angeles, CA	2010-02-01	2	NaN
Chicago, IL	2010-02-01	3	NaN
Dallas, TX	2010-02-01	4	NaN

In [32]:

# and the index names
onebeds_long.index.set_names('region', level=0, inplace=True)
onebeds_long.index.set_names('date', level=1, inplace=True)
onebeds_long.head()

Out[32]:

		size_rank	onebed_price
region	date
United States	2010-02-01	0	NaN
New York, NY	2010-02-01	1	NaN
Los Angeles, CA	2010-02-01	2	NaN
Chicago, IL	2010-02-01	3	NaN
Dallas, TX	2010-02-01	4	NaN

Let's visualize one-bedroom prices in Atlanta using our timeseries and matplotlib skills from previous lectures.

In [33]:

# Peel off the data for Atlanta
ga = onebeds_long.loc['Atlanta, GA']
avg_onebed = ga['onebed_price'].mean()

fig, ax = plt.subplots(figsize=(15,8))

# add the plot
ax.plot(ga.index, ga['onebed_price'], color='red')

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.set_title('Median Rental Prices: Atlanta, GA',fontsize=24)
ax.set_ylabel('Current Dollars',fontsize=18)
ax.set_ylim(700,1400)

ax.text(dt.datetime(2017,10,10), 1110, 'One Bedroom\n($'+ f'{round(avg_onebed,2):,}' +')' ,ha='center',fontsize=18)

# Let's add some commas to the y-axis using a list comprehension.
# The argument "ax.get_yticks().tolist()" is getting the yticks and converting them to a list.
# The rest is a list comprehension which converts each element to an integer with commas
# as the thousands separator. Note that '{:,}'.format(int(x)) is the same as when you 
# want to print out a number in a sentence.
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()],size=16)
plt.xticks(fontsize=16)

plt.show()

C:\Users\jt83241\AppData\Local\Temp\ipykernel_18916\2808462175.py:24: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()],size=16)

Practice: Reshaping¶

Let's try some more reshaping.

Load 'broadband.csv'. The file contains gdp per capita and broadband penetration data for several countries. Are the data long or wide?

In [34]:

bb = pd.read_csv('./Data/broadband.csv')
bb.head(10)

Out[34]:

	country	series	value
0	Switzerland	broadband	46.77000
1	Switzerland	gdppc	64834.80053
2	Denmark	broadband	42.92300
3	Denmark	gdppc	51495.56007
4	France	broadband	42.84300
5	France	gdppc	42858.45494
6	Netherlands	broadband	42.12600
7	Netherlands	gdppc	52799.14951
8	Korea	broadband	41.20000
9	Korea	gdppc	38350.32339

Set country and series as the index.

In [35]:

bb = bb.set_index(['country', 'series'])
bb.head()

Out[35]:

		value
country	series
Switzerland	broadband	46.77000
Switzerland	gdppc	64834.80053
Denmark	broadband	42.92300
Denmark	gdppc	51495.56007
France	broadband	42.84300

Create a DataFrame with broadband and gdppc as columns, country as rows.

In [36]:

bb = bb.unstack()
bb.head()

Out[36]:

	value
series	broadband	gdppc
country
Australia	31.796	50588.14949
Austria	28.543	52467.52700
Belgium	38.588	47941.66117
Canada	37.847	46704.89203
Chile	16.515	24012.91487

Reload the broadband data file to a new DataFrame.

In [37]:

bb2 = pd.read_csv('./Data/broadband.csv')

Create the DataFrame you have for part 3., but do it using melt or pivot_table.

In [38]:

bb2 = bb2.pivot_table(index=['country'], columns=['series'], values = 'value')  # pivot_table is used for long to wide
bb2.head(10)

Out[38]:

series	broadband	gdppc
country
Australia	31.796	50588.14949
Austria	28.543	52467.52700
Belgium	38.588	47941.66117
Canada	37.847	46704.89203
Chile	16.515	24012.91487
Colombia	12.821	14485.30586
Czech Republic	29.636	36350.10463
Denmark	42.923	51495.56007
Estonia	29.824	31739.11999
Finland	31.036	44955.58312

Make a scatter plot of gdp per capita against broadband penetration.

In [39]:

fig, ax = plt.subplots(figsize=(15,10))

ax.scatter(bb2['broadband'], bb2['gdppc'], color='red')

ax.set_ylabel('GDP per capita',size=18)
ax.set_xlabel('Broadband penetration',size=18)
ax.set_title('Income and broadband useage in OECD countries',size=24)

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.set_ylim(0,120000)
ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()],size=16)
plt.xticks(fontsize=16)
plt.show()

C:\Users\jt83241\AppData\Local\Temp\ipykernel_18916\1915233337.py:13: UserWarning: FixedFormatter should only be used together with FixedLocator
  ax.set_yticklabels(['{:,}'.format(int(x)) for x in ax.get_yticks().tolist()],size=16)

Save your DataFrame as 'broadband_wide.csv'.

In [40]:

bb2.to_csv('broadband_wide.csv')

Read 'broadband_wide.csv' into a DataFrame named bb_wide.

In [41]:

bb_wide = pd.read_csv('broadband_wide.csv')
bb_wide.head()

Out[41]:

	country	broadband	gdppc
0	Australia	31.796	50588.14949
1	Austria	28.543	52467.52700
2	Belgium	38.588	47941.66117
3	Canada	37.847	46704.89203
4	Chile	16.515	24012.91487

Turn 'bb_wide' into a long dataset using pivot_table or melt.

In [42]:

bb_long = bb_wide.melt(id_vars=['country'])
bb_long.head()

Out[42]:

	country	variable	value
0	Australia	broadband	31.796
1	Austria	broadband	28.543
2	Belgium	broadband	38.588
3	Canada	broadband	37.847
4	Chile	broadband	16.515