Lecture 13: Merging DataFrames [SUGGESTED SOLUTIONS]

We will often find ourselves with variables spread across different datasets and files. We merge datasets together by matching up the two datasets on one or more variables. For example, I might have GDP data by country from the Penn World Tables, and demographic data by country from the World Bank. We would merge these two datasets and match up the observations by country.

Here's an outline of today's lecture:

Apartment Data from Zillow

Merging Two DataFrames into One DataFrame

Plotting Results

1. Apartment Data from Zillow ¶

We are working with data from Zillow on rental listing prices. I would like to see how studio and one bedroom prices differ from each other across cities. Unfortunately, Zillow only lets me download one series at a time. (Hey, the data are free so I'm not complaining...maybe) I have downloaded the studio and one bedroom rental listing pricing and now I would like to create one DataFrame with all the data.

Time to get to work.

In [1]:

import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt

In [2]:

# The data come as csv files. Load them into separate DatFrames.
studios = pd.read_csv('./Data/Metro_MedianRentalPrice_Studio.csv')
onebeds = pd.read_csv('./Data/Metro_MedianRentalPrice_1Bedroom.csv')

In [3]:

# Always check out new data! You never know what could be lurking. 
print(studios.head())
print(studios.tail())

        RegionName  SizeRank  2010-01  2010-02  2010-03  2010-04  2010-05  \
0    United States         0   1300.0   1500.0   1500.0   1500.0   1500.0   
1     New York, NY         1      NaN   2000.0   2000.0   2100.0   2000.0   
2  Los Angeles, CA         2      NaN      NaN      NaN      NaN      NaN   
3      Chicago, IL         3      NaN   1500.0   1500.0   1500.0   1500.0   
4       Dallas, TX         4      NaN      NaN      NaN      NaN      NaN   

   2010-06  2010-07  2010-08  ...  2017-11  2017-12  2018-01  2018-02  \
0   1500.0   1500.0   1500.0  ...   1495.0   1500.0   1495.0   1500.0   
1   2125.0   2000.0   1950.0  ...   2300.0   2300.0   2300.0   2300.0   
2      NaN      NaN      NaN  ...   1800.0   1815.0   1795.0   1807.0   
3   1500.0   1500.0   1500.0  ...   1532.5   1500.0   1500.0   1500.0   
4      NaN      NaN      NaN  ...   1300.0   1300.0   1342.5   1330.0   

   2018-03  2018-04  2018-05  2018-06  2018-07  2018-08  
0   1500.0   1520.0   1550.0   1550.0   1550.0   1550.0  
1   2300.0   2350.0   2400.0   2400.0   2400.0   2400.0  
2   1805.0   1829.0   1850.0   1810.0   1835.0   1877.0  
3   1500.0   1492.0   1500.0   1500.0   1510.0   1525.0  
4   1325.0   1300.0   1325.0   1350.0   1355.0   1350.0  

[5 rows x 106 columns]
          RegionName  SizeRank  2010-01  2010-02  2010-03  2010-04  2010-05  \
162     Cheyenne, WY       162      NaN      NaN      NaN      NaN      NaN   
163      Branson, MO       163      NaN      NaN      NaN      NaN      NaN   
164   Hinesville, GA       164      NaN      NaN      NaN      NaN      NaN   
165  Waynesville, MO       165      NaN      NaN      NaN      NaN      NaN   
166   Enterprise, AL       166      NaN      NaN      NaN      NaN      NaN   

     2010-06  2010-07  2010-08  ...  2017-11  2017-12  2018-01  2018-02  \
162      NaN      NaN      NaN  ...   1200.0   1250.0   1200.0   1225.0   
163      NaN      NaN      NaN  ...    800.0    800.0    825.0    800.0   
164      NaN      NaN      NaN  ...   1150.0   1150.0   1150.0   1200.0   
165      NaN      NaN      NaN  ...   1050.0   1037.5    950.0   1025.0   
166      NaN      NaN      NaN  ...    950.0    950.0    950.0    950.0   

     2018-03  2018-04  2018-05  2018-06  2018-07  2018-08  
162   1200.0   1250.0   1285.0   1295.0   1295.0   1217.5  
163    795.0    850.0    885.0    850.0    862.5    850.0  
164   1200.0   1200.0   1250.0   1225.0   1250.0   1150.0  
165   1025.0   1050.0   1075.0   1095.0   1025.0   1025.0  
166    925.0    900.0    900.0    925.0    925.0    950.0  

[5 rows x 106 columns]

In [4]:

print(onebeds.head())
print(onebeds.tail())

        RegionName  SizeRank  2010-02  2010-03  2010-04  2010-05  2010-06  \
0    United States         0      NaN   1200.0   1245.0   1200.0   1250.0   
1     New York, NY         1      NaN      NaN      NaN      NaN      NaN   
2  Los Angeles, CA         2      NaN   1380.0   1425.0   1420.0   1437.5   
3      Chicago, IL         3      NaN   1050.0   1100.0   1050.0   1075.0   
4       Dallas, TX         4      NaN      NaN      NaN      NaN      NaN   

   2010-07  2010-08  2010-09  ...  2017-11  2017-12  2018-01  2018-02  \
0   1200.0   1200.0   1044.0  ...   1400.0   1400.0   1400.0   1405.0   
1      NaN      NaN      NaN  ...   2095.0   2050.0   2050.0   2085.0   
2   1550.0   1600.0   1388.0  ...   1961.0   1947.0   1930.0   1950.0   
3    995.0    950.0   1090.0  ...   1525.0   1495.0   1477.0   1495.0   
4      NaN      NaN   1000.0  ...   1217.0   1205.0   1200.0   1242.0   

   2018-03  2018-04  2018-05  2018-06  2018-07  2018-08  
0   1425.0   1434.0   1450.0   1450.0   1460.0   1460.0  
1   2100.0   2135.0   2150.0   2192.0   2155.0   2153.0  
2   1960.0   1950.0   1995.0   2026.0   2050.0   2095.0  
3   1495.0   1495.0   1495.0   1495.0   1500.0   1500.0  
4   1235.0   1235.0   1252.0   1255.0   1267.0   1266.0  

[5 rows x 105 columns]
          RegionName  SizeRank  2010-02  2010-03  2010-04  2010-05  2010-06  \
219       Casper, WY       219      NaN      NaN      NaN      NaN      NaN   
220        Minot, ND       220      NaN      NaN      NaN      NaN      NaN   
221  Waynesville, MO       221      NaN      NaN      NaN      NaN      NaN   
222       Clovis, NM       222      NaN      NaN      NaN      NaN      NaN   
223        Rolla, MO       223      NaN      NaN      NaN      NaN      NaN   

     2010-07  2010-08  2010-09  ...  2017-11  2017-12  2018-01  2018-02  \
219      NaN      NaN      NaN  ...    662.5    699.5    650.0    650.0   
220      NaN      NaN      NaN  ...    495.0    495.0    495.0    495.0   
221      NaN      NaN      NaN  ...    535.0    535.0    535.0    530.0   
222      NaN      NaN      NaN  ...    550.0    525.0    505.0    505.0   
223      NaN      NaN      NaN  ...    422.5    425.0    425.0    442.5   

     2018-03  2018-04  2018-05  2018-06  2018-07  2018-08  
219    650.0    625.0    675.0    650.0    637.5    599.0  
220    495.0    495.0    495.0    535.0    515.0    495.0  
221    532.5    535.0    542.5    540.0    535.0    545.0  
222    500.0    515.0    525.0    585.0    585.0    515.0  
223    450.0    450.0    450.0    450.0    450.0    450.0  

[5 rows x 105 columns]

A few things to note.¶

There are more regions in the one-bedroom data (i.e., there are more rows).
Date is listed as columns.

What do we want?¶

One DataFrame with one-bedroom and studio prices.
A multiIndex of regions and date on the rows (or a column for region and a column for date).
Columns of studio and one bedroom prices.

In [5]:

# Drop columns we don't need.
studios = studios.drop('SizeRank', axis=1)
onebeds = onebeds.drop('SizeRank', axis=1)

Make the data long¶

We could assign indexes and reshape this using stack or we can use melt and skip the indexing.

In [6]:

studios = pd.melt(studios, ['RegionName']) # we spec RegionName as the grouping variable
studios.head()

Out[6]:

	RegionName	variable	value
0	United States	2010-01	1300.0
1	New York, NY	2010-01	NaN
2	Los Angeles, CA	2010-01	NaN
3	Chicago, IL	2010-01	NaN
4	Dallas, TX	2010-01	NaN

In [7]:

studios.shape

Out[7]:

(17368, 3)

There are 17,368 observations (i..e, region-date pairs) in the studio dataframe.

Rename variables¶

In [8]:

studios.rename(columns={'RegionName':'region', 'variable':'date', 'value':'studio_price'}, inplace=True)
studios.head()

Out[8]:

	region	date	studio_price
0	United States	2010-01	1300.0
1	New York, NY	2010-01	NaN
2	Los Angeles, CA	2010-01	NaN
3	Chicago, IL	2010-01	NaN
4	Dallas, TX	2010-01	NaN

Repeat for one bedroom data¶

The studio data look great. Get the one bedroom data set up.

In [9]:

onebeds = pd.melt(onebeds, ['RegionName'])
onebeds.rename(columns={'RegionName':'region', 'variable':'date', 'value':'onebed_price'}, inplace=True)
onebeds.head()

Out[9]:

	region	date	onebed_price
0	United States	2010-02	NaN
1	New York, NY	2010-02	NaN
2	Los Angeles, CA	2010-02	NaN
3	Chicago, IL	2010-02	NaN
4	Dallas, TX	2010-02	NaN

In [10]:

onebeds.shape

Out[10]:

(23072, 3)

There are 23,072 observations (i..e, region-date pairs) in the one-bedroom dataframe so there are more region-date pairs in these data.

2. Merging DataFrames into One DataFrame (top)¶

Now that we're done with the wind-up, time to do what this notebook is about: Merging two sets of data.

We want to match the two DataFrames together according to the region-date pairs. In database-ese, we refer to variables we are matching as keys and the keys, here, are region and date. We could match these keys in potentially different ways. For example, there may be several observations for a given key in one set of data that the other set. What to do then? We have to tell the computer how to join (more database-ese) the two DataFrames. There are four different options for join:

inner: keep the intersection of the keys

left: keep all the keys from the left DataFrame

right: keep all the keys from right DataFrame

outer: keep all the keys from both DataFrames

We specify the join type with the how parameter. The default is inner, but for sanity's sake, be explicit about your join.

Back to our example. We want to merge the one-bedroom and studio dataframes. Both are organized by the region and date and these keys form a unique observation for each dataframe. Let's try joining by inner which implicitly will drop any observations in one dataframe but not the other. Had we matched by outer we would have kept these regardlessof which dataframe the extra observations reside. What would left and right do?

In [11]:

# left and right specify the DataFrames to merge, on specifies the keys (we are using two keys)
rental_prices = pd.merge(left=studios, right=onebeds, on=['region', 'date'], how='inner')
rental_prices.head()

Out[11]:

	region	date	studio_price	onebed_price
0	United States	2010-02	1500.0	NaN
1	New York, NY	2010-02	2000.0	NaN
2	Los Angeles, CA	2010-02	NaN	NaN
3	Chicago, IL	2010-02	1500.0	NaN
4	Dallas, TX	2010-02	NaN	NaN

In [12]:

rental_prices.shape

Out[12]:

(12669, 4)

There are only 12,669 region-date pairs in the merged data. That's less than either the studio or the one-bedroom data.

An Alternative Approach¶

So far we've been merging dataframes using the pandas package (ie, pd.merge()) but since dataframes are pandas objects, you can also merge dataframes using the method .merge(). For example, we could have done the same inner merge with the following code:

In [13]:

rental_prices = studios.merge(onebeds, on=['region', 'date'], how='inner')
rental_prices.head()

Out[13]:

	region	date	studio_price	onebed_price
0	United States	2010-02	1500.0	NaN
1	New York, NY	2010-02	2000.0	NaN
2	Los Angeles, CA	2010-02	NaN	NaN
3	Chicago, IL	2010-02	1500.0	NaN
4	Dallas, TX	2010-02	NaN	NaN

In [14]:

rental_prices.shape

Out[14]:

(12669, 4)

The difference is that in taking this approach we're starting with the studios data frame as our "left" and onebeds is implicitly our "right" dataframe. All other logic regarding how the merge / join works is the same.

Practice¶

Try joining by left, right, and outer. What happens to the number of observations?

In [15]:

for join in ['inner','left','right','outer']:
    df = pd.merge(left=studios, right=onebeds, on=['region', 'date'], how=join)
    print("'" + join + "'" + f' join yields {df.shape[0]:,d} observations.')

'inner' join yields 12,669 observations.
'left' join yields 17,368 observations.
'right' join yields 23,072 observations.
'outer' join yields 27,771 observations.

Set the index to region and date¶

Back to the case where we're using the 'inner' join. Let's set the index as region and date.

In [16]:

rental_prices = pd.merge(left=studios, right=onebeds, on=['region', 'date'], how='inner')

rental_prices['date'] = pd.to_datetime(rental_prices['date'])   # set the date column to datetime objects

rental_prices.set_index(['region', 'date'], inplace=True) # set up the index and sort it!
rental_prices.sort_index(axis=0, inplace=True)
rental_prices.head()

Out[16]:

		studio_price	onebed_price
region	date
Abilene, TX	2010-02-01	NaN	NaN
	2010-03-01	NaN	NaN
	2010-04-01	NaN	NaN
	2010-05-01	NaN	NaN
	2010-06-01	NaN	NaN

In [17]:

rental_prices.loc['United States']

Out[17]:

	studio_price	onebed_price
date
2010-02-01	1500.0	NaN
2010-03-01	1500.0	1200.0
2010-04-01	1500.0	1245.0
2010-05-01	1500.0	1200.0
2010-06-01	1500.0	1250.0
...	...	...
2018-04-01	1520.0	1434.0
2018-05-01	1550.0	1450.0
2018-06-01	1550.0	1450.0
2018-07-01	1550.0	1460.0
2018-08-01	1550.0	1460.0

103 rows × 2 columns

Try an outer merge¶

Notice that 2010-01 is missing, even though the studio data have an entry for it. Let's try an outer join.

In [18]:

# The only difference from the previous merge is the switch to 'outer'
rental_prices = pd.merge(left=studios, right=onebeds, on=['region', 'date'], how='outer')

rental_prices['date'] = pd.to_datetime(rental_prices['date'])
rental_prices.set_index(['region', 'date'], inplace=True)
rental_prices.sort_index(axis=0, inplace=True)

rental_prices.loc['United States']

Out[18]:

	studio_price	onebed_price
date
2010-01-01	1300.0	NaN
2010-02-01	1500.0	NaN
2010-03-01	1500.0	1200.0
2010-04-01	1500.0	1245.0
2010-05-01	1500.0	1200.0
...	...	...
2018-04-01	1520.0	1434.0
2018-05-01	1550.0	1450.0
2018-06-01	1550.0	1450.0
2018-07-01	1550.0	1460.0
2018-08-01	1550.0	1460.0

104 rows × 2 columns

That makes more sense for us.

3. Plotting Results (top)¶

We now have the DataFrames combined into one object so let's start doing some exploratory data analysis (EDA) by plotting the results and looking for interesting patterns. Let's start close to home and look at what's happening to rental prices in Atlanta, GA.

In [19]:

# Peel off the data for Atlanta
ga = rental_prices.loc['Atlanta, GA']
avg_studio = ga['studio_price'].mean()
avg_onebed = ga['onebed_price'].mean()

fig, ax = plt.subplots(figsize=(15,8))

ax.plot(ga.index, ga['studio_price'], color='blue')
ax.plot(ga.index, ga['onebed_price'], color='red')

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.set_title('Median Rental Prices: Atlanta, GA',fontsize=24)
ax.text(dt.datetime(2016,7,10), 1320, 'Studio \n($'+ f'{round(avg_studio,2):,}' +')' ,ha='center',fontsize=18,color='blue')
ax.text(dt.datetime(2017,10,10), 1110, 'One Bedroom\n($'+ f'{round(avg_onebed,2):,}' +')' ,ha='center',fontsize=18,color='red')

plt.show()

It's interesting that studio rental prices exceed one bedroom prices after 2015. I wonder whyt? Note that I put the average median rental price of each line in the label using a different-looking format syntax. BTW I like putting summary statistics in labels. I think it's a nice way to convey information.

Let's also take a look at Florida. Maybe the trends are different.

In [20]:

miami = rental_prices.loc['Miami, FL']
tampa = rental_prices.loc['Tampa, FL']

avg_studio = miami['studio_price'].mean()
avg_onebed = miami['onebed_price'].mean()

fig, ax = plt.subplots(1, 2, figsize=(15,8))

ax[0].plot(miami.index, miami['studio_price'], color='blue')
ax[0].plot(miami.index, miami['onebed_price'], color='red')

ax[0].spines['top'].set_visible(False)
ax[0].spines['right'].set_visible(False)
ax[0].set_ylim(400,2200)

ax[0].set_title('Median rental prices: Miami, FL')
#ax[0].text('03/01/2015', 1400, 'one bedroom')
#ax[0].text('09/10/2015', 1960, 'studio')

ax[0].set_title('Median Rental Prices: Miami, FL',fontsize=18)
ax[0].text(dt.datetime(2015,12,31), 2000, 'Studio \n($'+ f'{round(avg_studio,2):,}' +')' ,ha='center',fontsize=14,color='blue')
ax[0].text(dt.datetime(2016,12,31), 1350, 'One Bedroom\n($'+ f'{round(avg_onebed,2):,}' +')' ,ha='center',fontsize=14,color='red')

ax[1].plot(tampa.index, tampa['studio_price'], color='blue')
ax[1].plot(tampa.index, tampa['onebed_price'], color='red')

ax[1].spines['top'].set_visible(False)
ax[1].spines['right'].set_visible(False)
ax[1].set_ylim(400,2200)

avg_studio = tampa['studio_price'].mean()
avg_onebed = tampa['onebed_price'].mean()

ax[1].set_title('Median Rental Prices: Tampa, FL',fontsize=18)
ax[1].text(dt.datetime(2015,9,10), 1450, 'Studio \n($'+ f'{round(avg_studio,2):,}' +')' ,ha='center',fontsize=14,color='blue')
ax[1].text(dt.datetime(2016,3,1), 700, 'One Bedroom\n($'+ f'{round(avg_onebed,2):,}' +')' ,ha='center',fontsize=14,color='red')

plt.show()

Interesting...one bedroom apartments are always cheaper than studios whereas in Atlanta there was a crossing between the two somewhere in 2015. Maybe this reflects studios being located in more desirable neighborhoods?

Practice¶

I downloaded data from MovieLens. There are five files in your data folder.

Load the movies.csv and the ratings.csv files as DataFrames. The movies file lists movie names and genre while the ratings file includes ratings (think Amazon "stars") by different viewers. NB, both files are in a sub-folder cleverly called "MovieLens."

In [21]:

movies = pd.read_csv('./Data/MovieLens/movies.csv')
movies.head()

Out[21]:

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy

In [22]:

ratings = pd.read_csv('./Data/MovieLens/ratings.csv')
ratings.head()

Out[22]:

	userId	movieId	rating	timestamp
0	1	1	4.0	964982703
1	1	3	4.0	964981247
2	1	6	4.0	964982224
3	1	47	5.0	964983815
4	1	50	5.0	964982931

What size are the two DataFrames?

In [23]:

print(movies.shape)
print(ratings.shape)

(9742, 3)
(100836, 4)

There are many ratings per movie. Merge the two files on the movieId. What kind of join should you use?

In [24]:

# Each movie shows up several times in the ratings file (each movie is reviewed by many raters)
# We use a left join to keep all the observations in the ratings file (an inner would work as well)
data = pd.merge(left=ratings, right=movies, on='movieId', how='left')
data.head()

Out[24]:

	userId	movieId	rating	timestamp	title	genres
0	1	1	4.0	964982703	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	1	3	4.0	964981247	Grumpier Old Men (1995)	Comedy\|Romance
2	1	6	4.0	964982224	Heat (1995)	Action\|Crime\|Thriller
3	1	47	5.0	964983815	Seven (a.k.a. Se7en) (1995)	Mystery\|Thriller
4	1	50	5.0	964982931	Usual Suspects, The (1995)	Crime\|Mystery\|Thriller

What size is the resulting DataFrame?

In [25]:

print(data.shape)

(100836, 6)

Collapse the data so you have the median rating for each movie.

In [26]:

data = data.groupby(['movieId','title','genres'],as_index=False).agg({'rating':'median'})
data.head()

Out[26]:

	movieId	title	genres	rating
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	4.0
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy	3.5
2	3	Grumpier Old Men (1995)	Comedy\|Romance	3.0
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance	3.0
4	5	Father of the Bride Part II (1995)	Comedy	3.0

Create a histogram of the ratings (i.e., ax.hist()). Use 10 bins. Use the density=True parameter to plot the density rather than the counts.

In [27]:

fig, ax = plt.subplots(figsize=(15,5))
ax.hist(data['rating'], bins=10, color = 'blue', alpha = 0.25, density=True)

ax.set_title('MovieLens Ratings')
ax.set_xlabel('Rating')
ax.set_ylabel('Density')
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)
plt.show()

Load the tags.csv file. It's another file from MovieLens. These are topics or themes covered in each movie so each movie will contain many tags.

In [28]:

tags = pd.read_csv('./Data/MovieLens/tags.csv')
tags.drop(['userId','timestamp'],axis=1,inplace=True)  # drop userid and timestamp
tags.drop_duplicates(subset=['movieId','tag'], keep='first', inplace=True) # we haven't covered this function.
                                                                           # I'm keeping only the first occurrence
                                                                           # of each tag for each movie.
tags.head()

Out[28]:

	movieId	tag
0	60756	funny
1	60756	Highly quotable
2	60756	will ferrell
3	89774	Boxing story
4	89774	MMA

Merge it into your MovieLens data. What keys should you use? Hint: Look at the column names. What type of join?

In [29]:

df = pd.merge(left=data, right=tags, on=['movieId'], how='left')
df.head()

Out[29]:

	movieId	title	genres	rating	tag
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	4.0	pixar
1	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	4.0	fun
2	2	Jumanji (1995)	Adventure\|Children\|Fantasy	3.5	fantasy
3	2	Jumanji (1995)	Adventure\|Children\|Fantasy	3.5	magic board game
4	2	Jumanji (1995)	Adventure\|Children\|Fantasy	3.5	Robin Williams

Let's get a look at the kinds of tags we have. Try the code below, which introduces us to the value_counts() method of DataFrame.

print(df['tag'].value_counts())

In [30]:

print(df['tag'].value_counts())

In Netflix queue    119
atmospheric          32
superhero            22
religion             22
Disney               22
                   ... 
free speech           1
south park            1
Trey Parker           1
best comedy           1
remaster              1
Name: tag, Length: 1584, dtype: int64

I don't know what 'atmospheric' means -- sounds fancy -- while 'Heoric Bloodshed' sounds terrifying. That tag only appears once and applied to John Wick. Makes sense, I guess. Let's take a look at the tag 'superhero'.

Compute the average rating for all movies and for movies tagged 'superhero'.

In [31]:

superheros = df[df['tag']=='superhero']
super_mean = superheros['rating'].mean()

all_mean = df['rating'].mean()
print(f'The average rating for superhero movies is {super_mean:4.2f} compared to {all_mean:4.2f} for all movies.') 

The average rating for superhero movies is 3.24 compared to 3.40 for all movies.

Let's see how the ratings of superhero movies compare to all movies. We will plot two histograms on the same axes.
- Plot the histogram of movies tagged 'superhero'. Use 10 bins. Make the plot blue. Set alpha=0.25.
- Plot the histogram of all movies. Use 10 bins. Make the plot red. Set alpha=0.25.
- Add a legend: plt.legend(). Put it in a spot that makes sense using the loc= command in plt.legend().
- Add the mean ratings for all movies and for superhero movies as text to the histogram: ax.annotate().
- Get rid of the top and right spines: ax.spines['top'].set_visible(False)
- Add axes and a chart title that make sense.

In [32]:

fig, ax = plt.subplots(figsize=(15,5))

# Two histograms one-atop the other
ax.hist(superheros['rating'], bins=10, color = 'blue', alpha = 0.25, density=True, label = 'Tagged superhero')
ax.hist(data['rating'], bins=10, color = 'red', alpha = 0.25, density=True, label = 'All movies')

ax.set_title('MovieLens Ratings',size=26,pad=30)  # "pad" adds space under the title
ax.set_xlabel('Rating',size=20)
ax.set_ylabel('Density\n',size=20)
ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.text( 0.5, 0.9, f'Mean rating for all movies: {all_mean:4.2f}',fontsize=18)
ax.text( 0.5, 0.8, f'Mean rating for superhero movies: {super_mean:4.2f}',fontsize=18)

plt.xticks(size=18)
plt.yticks(size=18)
plt.legend(frameon=False,loc='upper left',fontsize=20)

plt.show()

In [ ]:

Lecture 13: Merging DataFrames [SUGGESTED SOLUTIONS]

1. Apartment Data from Zillow ¶

A few things to note.¶

What do we want?¶

Make the data long¶

Rename variables¶

Repeat for one bedroom data¶

2. Merging DataFrames into One DataFrame (top)¶

An Alternative Approach¶

Practice¶

Set the index to region and date¶

Try an outer merge¶

3. Plotting Results (top)¶

Practice¶

Jeff Thurk // jeff.thurk@uga.edu // Department of Economics // University of Georgia