Lecture14_cleaning_finished

Lecture 14: Cleaning and transforming data [SUGGESTED SOLUTIONS]

We now know how to create some complicated datasets. In this notebook we will continue practicing techniques for manipulating data into forms that are amenable to analysis. We will also learn about the American Time Use Survey which is a really cool dataset.

The outline of today's lecture is as follows:

The American Time Use Survey

Useful methods to work with and clean numerical data

Useful methods to create categorical data

Useful methods to work with and clean strings

In [1]:

import pandas as pd                 # pandas for data handling
import matplotlib.pyplot as plt     # matplotlib for plots
import numpy as np                  # numpy for numerical methods 

1. American Time Use Survey (ATUS) (return)¶

The Bureau of Labor Statistics oversees the American Time Use Survey, which asks a sample of Americans to complete detailed diaries keeping track of each minute of their day.

Follow this link https://www.bls.gov/tus/datafiles_2017.htm to the page for the 2017 survey. Download the ATUS 2017 Activity summary file (zip) file located in the 2017 Basic ATUS Data Files section of the page. Alternatively, download it directly https://www.bls.gov/tus/special.requests/atussum_2017.zip.

Unzip the file. We are looking for atussum_2017.dat. It is a coma separated file. Let's get it loaded.

In [2]:

atus = pd.read_csv('./Data/atussum_2017/atussum_2017.dat')

In [3]:

atus.head()

Out[3]:

	TUCASEID	TUFINLWGT	TRYHHCHILD	TEAGE	TESEX	PEEDUCA	PTDTRACE	PEHSPNON	GTMETSTA	TELFS	...
0	20170101170002	5.217551e+06	15	34	2	39	2	2	1	1	...
1	20170101170012	3.106847e+06	7	28	2	40	1	2	1	5	...
2	20170101170027	1.653409e+07	15	15	1	35	3	2	1	5	...
3	20170101170030	1.757071e+07	-1	46	1	39	1	2	1	1	...
4	20170101170033	1.022275e+06	-1	85	1	44	1	1	1	1	...

5 rows × 421 columns

421 variables! That's too many for us today. Let's just keep a few.

In [4]:

col_list=atus.columns.tolist()   # columns returns a list object. Turn it into a list for easier printing.

# Print the first and last 20 variables. 
print(col_list[0:20],'\n')
print(col_list[-20:])

['TUCASEID', 'TUFINLWGT', 'TRYHHCHILD', 'TEAGE', 'TESEX', 'PEEDUCA', 'PTDTRACE', 'PEHSPNON', 'GTMETSTA', 'TELFS', 'TEMJOT', 'TRDPFTPT', 'TESCHENR', 'TESCHLVL', 'TRSPPRES', 'TESPEMPNOT', 'TRERNWA', 'TRCHILDNUM', 'TRSPFTPT', 'TEHRUSLT'] 

['t181204', 't181205', 't181299', 't181301', 't181302', 't181399', 't181401', 't181499', 't181501', 't181599', 't181601', 't181699', 't181801', 't181899', 't189999', 't500101', 't500103', 't500105', 't500106', 't500107']

The demographic variables are all uppercase letters. The time variables are of the form 'txxyyzz' where xx is the major category code, yy is the second-tier code, and zz is the third-tier code. (docs)

Let's keep some demo data and some data about working and sleeping.

In [5]:

vars_to_keep = ['TEAGE', 'TESEX', 'PTDTRACE', 'PEEDUCA', 'GTMETSTA', 'TELFS', 
                'TUDIARYDAY', 't050101', 't050102', 't050201', 't010101', 't010102']

atus_small = atus[vars_to_keep]

In [6]:

atus_small.head()

Out[6]:

	TEAGE	TESEX	PTDTRACE	PEEDUCA	GTMETSTA	TELFS	TUDIARYDAY	t050101	t010101
0	34	2	2	39	1	1	1	450	728
1	28	2	1	40	1	5	7	0	385
2	15	1	3	35	1	5	4	0	570
3	46	1	1	39	1	1	2	480	525
4	85	1	1	44	1	1	7	0	756

Let's give the columns some more reasonable names.

In [7]:

# See https://www.bls.gov/tus/freqvariables.pdf for frequently used variables
atus_small = atus_small.rename(columns={'TEAGE':'age', 'TESEX':'gender', 'PTDTRACE':'race', 
                           'PEEDUCA':'edu', 'GTMETSTA':'metro', 'TELFS':'employ', 'TUDIARYDAY':'day'})
atus_small.head()

Out[7]:

	age	gender	race	edu	metro	employ	day	t050101	t010101
0	34	2	2	39	1	1	1	450	728
1	28	2	1	40	1	5	7	0	385
2	15	1	3	35	1	5	4	0	570
3	46	1	1	39	1	1	2	480	525
4	85	1	1	44	1	1	7	0	756

In [8]:

# See https://www.bls.gov/tus/lexiconnoex2017.pdf for definitions
atus_small = atus_small.rename(columns={'t050101':'work_main', 't050102':'work_other', 't010101':'sleep', 
                                        't050201':'work_soc', 't010102':'no_sleep'})

2. Useful methods to work with and clean numerical data (return)¶

Okay, we have variable names that we can live with. Let's start transforming the data to get it ready for analysis.

A word on loops¶

We have seen many places where a loop saved us time and effort. It might seem natural to use a loop to perform an operation on each element of a column or a DataFrame.

In general, however, we want to avoid this. Instead, we have used pandas vectorized operations such as

x['new_var'] = x['var_1'] / x['var_2']

to perform element-wise division. Using pandas' native operations is much (much much) faster than looping over the rows of a DataFrame. A lot of optimization has been written into the native functions that is not there when we loop ourselves.

Fortunately, pandas provides methods that let us to very complex and very general operations to a DataFrame without resorting to a loop. We consider a few of these below.

replace( )¶

The gender variable is coded 1 for male and 2 for female. I do not want to have to remember that!

The .replace() method replaces one value for another. One syntax is

atus_small['gender'] = atus_small['gender'].replace(1, 'male')

but a more powerful one passes a dict or a list.

atus_small['gender'] = atus_small['gender'].replace({1:'male', 2:'female'})

In [9]:

gender_codes = {1:'male', 2:'female'}
atus_small['gender'] = atus_small['gender'].replace(gender_codes)
atus_small.head()

Out[9]:

	age	gender	race	edu	metro	employ	day	work_main	sleep
0	34	female	2	39	1	1	1	450	728
1	28	female	1	40	1	5	7	0	385
2	15	male	3	35	1	5	4	0	570
3	46	male	1	39	1	1	2	480	525
4	85	male	1	44	1	1	7	0	756

unique( )¶

Let's code race, too. What codes are in our data? The method .unique() returns all unique unique values.

In [10]:

races = atus_small['race'].unique()
races.sort()
print(races)

[ 1  2  3  4  5  6  7  8  9 10 11 13 16 18]

In [11]:

# I looked up the codes in the documentation...
# https://www.bls.gov/tus/atuscpscodebk0314.pdf
race_codes = {1:'white', 2:'black', 3:'native_am', 4:'asian', 5:'hawaiian', 6:'wh_bl', 
              7:'wh_na', 8:'wh_as', 9:'wh_ha', 10:'bl_na', 11:'bl_as', 13:'na_as', 16:'wh_bl_as', 18:'wh_as_ha'}

atus_small['race'] = atus_small['race'].replace(race_codes)
atus_small.head()

Out[11]:

	age	gender	race	edu	metro	employ	day	work_main	sleep
0	34	female	black	39	1	1	1	450	728
1	28	female	white	40	1	5	7	0	385
2	15	male	native_am	35	1	5	4	0	570
3	46	male	white	39	1	1	2	480	525
4	85	male	white	44	1	1	7	0	756

Apply a function to a column: map()¶

We can apply functions to columns, too. The .map() functions handles this for us. It applies the given function to each element of the column. These can be built in functions, or user-defined functions. This is super powerful. For example, we can write a function that performs a complicated transformation and apply to each element of a column in one simple line.

Let's define a function that converts minutes to hours.

In [12]:

def minutes_to_hours(x):
    return x/60

Now apply .map() to the work column.

And yes, this example is a bit contrived (i.e., stupid) since we could have just as easily done atus_small['work_main']/60. But that wouldn't have given us practice with .map()! I guess, the glass-half-full perspective is that we could do both and compare.

In [13]:

atus_small['work_main'] = atus_small['work_main'].map(minutes_to_hours)
atus_small.head()

Out[13]:

	age	gender	race	edu	metro	employ	day	work_main	sleep
0	34	female	black	39	1	1	1	7.5	728
1	28	female	white	40	1	5	7	0.0	385
2	15	male	native_am	35	1	5	4	0.0	570
3	46	male	white	39	1	1	2	8.0	525
4	85	male	white	44	1	1	7	0.0	756

Apply a function to a DataFrame: applymap( )¶

If we want to apply the same function to several columns of a DataFrame (rather than a series as we did above) we use .applymap(). It works the same way, applying the function to each element.

In [14]:

# We can map to several columns at once.
atus_small[['work_other', 'sleep', 'no_sleep']] = atus_small[['work_other', 'sleep', 'no_sleep']].applymap(minutes_to_hours)
atus_small.head()

Out[14]:

	age	gender	race	edu	metro	employ	day	work_main	sleep
0	34	female	black	39	1	1	1	7.5	12.133333
1	28	female	white	40	1	5	7	0.0	6.416667
2	15	male	native_am	35	1	5	4	0.0	9.500000
3	46	male	white	39	1	1	2	8.0	8.750000
4	85	male	white	44	1	1	7	0.0	12.600000

Practice</font >

Take a few minutes and try the following. Feel free to chat with those around you if you get stuck.

educ holds the highest level of education obtained. It can take values between 31 and 46. Covert the numeric values to labels.
- codes < 39 are 'less than high'
- 39 = 'high school'
- 40 = 'some college'
- 41 & 42 are 'associate'
- 43 = 'bachelor'
- 44 = 'master'
- 45 = 'prof'
- 46 = 'phd'

In [15]:

educ_codes = {39:'high school', 40:'some college', 41:'associate', 42:'associate', 
              43: 'bachelor', 44: 'master', 45:'prof', 46:'phd'}

atus_small.loc[ atus_small['edu']<39, 'edu'] = 'less than high'
atus_small['edu'] = atus_small['edu'].replace(educ_codes)

Make sure you covered all your bases. Print out a list of the unique values from the column 'edu'.

In [45]:

atus_small['edu'].unique()
atus_small.loc[10218,'edu']

Out[45]:

'associate'

Apply the minutes_to_hours function to the work_soc variable. This variable is the time spent 'Socializing, relaxing, and leisure as part of job'. I have no idea what kind of jobs make you relax and take leisure.

In [17]:

atus_small['work_soc'] = atus_small['work_soc'].map(minutes_to_hours)
atus_small.head()

Out[17]:

	age	gender	race	edu	metro	employ	day	work_main	sleep
0	34	female	black	high school	1	1	1	7.5	12.133333
1	28	female	white	some college	1	5	7	0.0	6.416667
2	15	male	native_am	less than high	1	5	4	0.0	9.500000
3	46	male	white	high school	1	1	2	8.0	8.750000
4	85	male	white	master	1	1	7	0.0	12.600000

Create a column named 'work' that is the sum of 'work_main' and 'work_other'.

In [18]:

atus_small['work'] = atus_small['work_main'] + atus_small['work_other']

Create a histogram of 'work' for only those observations with work>0. Use 40 bins, make the bars red, and set alpha=0.4.

fig, ax = plt.subplots(figsize = (15,10))
ax.hist()

In [19]:

fig, ax = plt.subplots(figsize=(15,10))

ax.hist(atus_small.loc[atus_small['work']>0, 'work'], bins=40, alpha = 0.4, color = 'red')

ax.set_xlabel('hours of work')
ax.set_title('Distribution of hours worked, conditional on working at least one minute')

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

plt.show()

3. Methods to create categorical variables (return)¶

Binning¶

Often we would like to take a continuous variable and make it discrete. One way to do this is to place the data into 'bins' or 'buckets'. (This is implicitly being done when we create a histogram.) Pandas has a nice facility for this. In fact, we've seen these (cut and qcut) before.

In [46]:

# What range are the ages?
atus_small['age'].describe()

Out[46]:

count    10223.000000
mean        49.876846
std         18.008395
min         15.000000
25%         35.000000
50%         50.000000
75%         64.000000
max         85.000000
Name: age, dtype: float64

In [21]:

# 15 to 85 years 

bins = [15, 25, 35, 45, 55, 65, 75]  # These are the endpoints for the bins

atus_small['age_bin'] = pd.cut(atus_small['age'], bins)

In [22]:

atus_small.head(10)

Out[22]:

	age	gender	race	edu	metro	employ	day	work_main	work_other	sleep	work	age_bin
0	34	female	black	high school	1	1	1	7.50	0.000000	12.133333	7.500000	(25.0, 35.0]
1	28	female	white	some college	1	5	7	0.00	0.000000	6.416667	0.000000	(25.0, 35.0]
2	15	male	native_am	less than high	1	5	4	0.00	0.000000	9.500000	0.000000	NaN
3	46	male	white	high school	1	1	2	8.00	0.000000	8.750000	8.000000	(45.0, 55.0]
4	85	male	white	master	1	1	7	0.00	0.000000	12.600000	0.000000	NaN
5	60	male	white	less than high	1	1	7	0.00	0.000000	9.750000	0.000000	(55.0, 65.0]
6	44	male	white	high school	2	1	2	8.25	0.000000	5.750000	8.250000	(35.0, 45.0]
7	49	female	black	bachelor	1	1	3	0.00	7.433333	6.500000	7.433333	(45.0, 55.0]
8	63	male	white	some college	1	1	1	1.50	0.000000	7.666667	1.500000	(55.0, 65.0]
9	70	female	white	master	1	5	6	0.00	0.000000	8.000000	0.000000	(65.0, 75.0]

Notice the NaN for observation 4. Our lowest bin starts at 15, but it is open on the left end point --- 15 is not included in the bin. We could set the bottom interval to 14 to catch this observation.

In [23]:

bins = [14, 25, 35, 45, 55, 65, 75]  # These are the endpoints for the bins
atus_small['age_bin'] = pd.cut(atus_small['age'], bins)
atus_small.head(10)

Out[23]:

	age	gender	race	edu	metro	employ	day	work_main	work_other	sleep	work	age_bin
0	34	female	black	high school	1	1	1	7.50	0.000000	12.133333	7.500000	(25.0, 35.0]
1	28	female	white	some college	1	5	7	0.00	0.000000	6.416667	0.000000	(25.0, 35.0]
2	15	male	native_am	less than high	1	5	4	0.00	0.000000	9.500000	0.000000	(14.0, 25.0]
3	46	male	white	high school	1	1	2	8.00	0.000000	8.750000	8.000000	(45.0, 55.0]
4	85	male	white	master	1	1	7	0.00	0.000000	12.600000	0.000000	NaN
5	60	male	white	less than high	1	1	7	0.00	0.000000	9.750000	0.000000	(55.0, 65.0]
6	44	male	white	high school	2	1	2	8.25	0.000000	5.750000	8.250000	(35.0, 45.0]
7	49	female	black	bachelor	1	1	3	0.00	7.433333	6.500000	7.433333	(45.0, 55.0]
8	63	male	white	some college	1	1	1	1.50	0.000000	7.666667	1.500000	(55.0, 65.0]
9	70	female	white	master	1	5	6	0.00	0.000000	8.000000	0.000000	(65.0, 75.0]

We could also make the bins closed on the left and opened on the right by passing 'right=False' parameter to cut( ).

In [24]:

bins = [15, 25, 35, 45, 55, 65, 75]  # These are the endpoints for the bins
atus_small['age_bin'] = pd.cut(atus_small['age'], bins, right=False)
atus_small.head(10)

Out[24]:

	age	gender	race	edu	metro	employ	day	work_main	work_other	sleep	work	age_bin
0	34	female	black	high school	1	1	1	7.50	0.000000	12.133333	7.500000	[25.0, 35.0)
1	28	female	white	some college	1	5	7	0.00	0.000000	6.416667	0.000000	[25.0, 35.0)
2	15	male	native_am	less than high	1	5	4	0.00	0.000000	9.500000	0.000000	[15.0, 25.0)
3	46	male	white	high school	1	1	2	8.00	0.000000	8.750000	8.000000	[45.0, 55.0)
4	85	male	white	master	1	1	7	0.00	0.000000	12.600000	0.000000	NaN
5	60	male	white	less than high	1	1	7	0.00	0.000000	9.750000	0.000000	[55.0, 65.0)
6	44	male	white	high school	2	1	2	8.25	0.000000	5.750000	8.250000	[35.0, 45.0)
7	49	female	black	bachelor	1	1	3	0.00	7.433333	6.500000	7.433333	[45.0, 55.0)
8	63	male	white	some college	1	1	1	1.50	0.000000	7.666667	1.500000	[55.0, 65.0)
9	70	female	white	master	1	5	6	0.00	0.000000	8.000000	0.000000	[65.0, 75.0)

We know from before that cut( ) can create bins for us, too. Pass cut an integer and get back that many equally-sized bins. The precision parameter determines how many decimal places are used in the bin edges.

In [25]:

atus_small['age_bin'] = pd.cut(atus_small['age'], 3, precision = 0)
atus_small.head(10)

Out[25]:

	age	gender	race	edu	metro	employ	day	work_main	work_other	sleep	work	age_bin
0	34	female	black	high school	1	1	1	7.50	0.000000	12.133333	7.500000	(15.0, 38.0]
1	28	female	white	some college	1	5	7	0.00	0.000000	6.416667	0.000000	(15.0, 38.0]
2	15	male	native_am	less than high	1	5	4	0.00	0.000000	9.500000	0.000000	(15.0, 38.0]
3	46	male	white	high school	1	1	2	8.00	0.000000	8.750000	8.000000	(38.0, 62.0]
4	85	male	white	master	1	1	7	0.00	0.000000	12.600000	0.000000	(62.0, 85.0]
5	60	male	white	less than high	1	1	7	0.00	0.000000	9.750000	0.000000	(38.0, 62.0]
6	44	male	white	high school	2	1	2	8.25	0.000000	5.750000	8.250000	(38.0, 62.0]
7	49	female	black	bachelor	1	1	3	0.00	7.433333	6.500000	7.433333	(38.0, 62.0]
8	63	male	white	some college	1	1	1	1.50	0.000000	7.666667	1.500000	(62.0, 85.0]
9	70	female	white	master	1	5	6	0.00	0.000000	8.000000	0.000000	(62.0, 85.0]

In [26]:

pd.value_counts(atus_small['age_bin'])  # How many observations in each bin?

Out[26]:

(38.0, 62.0]    4033
(15.0, 38.0]    3176
(62.0, 85.0]    3014
Name: age_bin, dtype: int64

Quantiles¶

Notice that the bins are equally spaced from the smallest to the largest values of age. The number of observations in each is not equal.

The method qcut( ) splits the data according to the sample quantiles. This should get you bins of approximately equal numbers of observations.

In [27]:

atus_small['age_bin'] = pd.qcut(atus_small['age'], 4)  # create quartiles based on age
atus_small.head(10)

Out[27]:

	age	gender	race	edu	metro	employ	day	work_main	work_other	sleep	work	age_bin
0	34	female	black	high school	1	1	1	7.50	0.000000	12.133333	7.500000	(14.999, 35.0]
1	28	female	white	some college	1	5	7	0.00	0.000000	6.416667	0.000000	(14.999, 35.0]
2	15	male	native_am	less than high	1	5	4	0.00	0.000000	9.500000	0.000000	(14.999, 35.0]
3	46	male	white	high school	1	1	2	8.00	0.000000	8.750000	8.000000	(35.0, 50.0]
4	85	male	white	master	1	1	7	0.00	0.000000	12.600000	0.000000	(64.0, 85.0]
5	60	male	white	less than high	1	1	7	0.00	0.000000	9.750000	0.000000	(50.0, 64.0]
6	44	male	white	high school	2	1	2	8.25	0.000000	5.750000	8.250000	(35.0, 50.0]
7	49	female	black	bachelor	1	1	3	0.00	7.433333	6.500000	7.433333	(35.0, 50.0]
8	63	male	white	some college	1	1	1	1.50	0.000000	7.666667	1.500000	(50.0, 64.0]
9	70	female	white	master	1	5	6	0.00	0.000000	8.000000	0.000000	(64.0, 85.0]

In [28]:

pd.value_counts(atus_small['age_bin'])  # How many observations in each bin?

Out[28]:

(35.0, 50.0]      2644
(14.999, 35.0]    2560
(50.0, 64.0]      2517
(64.0, 85.0]      2502
Name: age_bin, dtype: int64

4. Functions to work with and clean strings (return)¶

String methods¶

We have seen some of these before when we discussed strings. These are analogous to the string methods but set up for DataFrames. These vectorized string methods are optimized to work over an entire column. The method call looks like

data['var'].str.method()

where method( ) is the method we are applying. A list of vectorized string methods is on page 218 in McKinney. Below, we try a few out.

In [29]:

movies = pd.read_csv('./Data/MovieLens/movies.csv')
movies.head()

Out[29]:

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy

str.contains( )¶

The genres are mixed together. Let's get all the comedies. The .contains() method returns a bool Series with True for observations in which the string contains the search term.

In [30]:

movies['genres'].str.contains('Comedy')

Out[30]:

0        True
1       False
2        True
3        True
4        True
        ...  
9737     True
9738     True
9739    False
9740    False
9741     True
Name: genres, Length: 9742, dtype: bool

In [31]:

comedies = movies[movies['genres'].str.contains('Comedy')]
comedies.head()

Out[31]:

	movieId	title	genres
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy
2	3	Grumpier Old Men (1995)	Comedy\|Romance
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance
4	5	Father of the Bride Part II (1995)	Comedy
6	7	Sabrina (1995)	Comedy\|Romance

str.split( )¶

This method splits the string up at the delimiter that is passed to split(). It returns a list of each chunk that falls between the delimiter.

In [32]:

# The movie genres are separated with the '|' character. 
movies['genres'].str.split('|')

Out[32]:

0       [Adventure, Animation, Children, Comedy, Fantasy]
1                          [Adventure, Children, Fantasy]
2                                       [Comedy, Romance]
3                                [Comedy, Drama, Romance]
4                                                [Comedy]
                              ...                        
9737                 [Action, Animation, Comedy, Fantasy]
9738                         [Animation, Comedy, Fantasy]
9739                                              [Drama]
9740                                  [Action, Animation]
9741                                             [Comedy]
Name: genres, Length: 9742, dtype: object

This could be useful processing name data that come in the form: last,first or city,state.

In [33]:

# Remember, DataFrames can have columns of lists...

movies['genre_split'] = movies['genres'].str.split('|')
movies.head(10)

Out[33]:

	movieId	title	genres	genre_split
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	[Adventure, Animation, Children, Comedy, Fantasy]
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy	[Adventure, Children, Fantasy]
2	3	Grumpier Old Men (1995)	Comedy\|Romance	[Comedy, Romance]
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance	[Comedy, Drama, Romance]
4	5	Father of the Bride Part II (1995)	Comedy	[Comedy]
5	6	Heat (1995)	Action\|Crime\|Thriller	[Action, Crime, Thriller]
6	7	Sabrina (1995)	Comedy\|Romance	[Comedy, Romance]
7	8	Tom and Huck (1995)	Adventure\|Children	[Adventure, Children]
8	9	Sudden Death (1995)	Action	[Action]
9	10	GoldenEye (1995)	Action\|Adventure\|Thriller	[Action, Adventure, Thriller]

str.join ( )¶

Put strings together... and then separate the pieces with a delimiter of your choosing. I end up using this method quite a bit.

In [34]:

movies['genre_split'].str.join('::')

Out[34]:

0       Adventure::Animation::Children::Comedy::Fantasy
1                          Adventure::Children::Fantasy
2                                       Comedy::Romance
3                                Comedy::Drama::Romance
4                                                Comedy
                             ...                       
9737                 Action::Animation::Comedy::Fantasy
9738                         Animation::Comedy::Fantasy
9739                                              Drama
9740                                  Action::Animation
9741                                             Comedy
Name: genre_split, Length: 9742, dtype: object

Practice</font >

What era had the best movies? Take a few minutes and try the following. Feel free to chat with those around you if you get stuck.

ate a copy of the 'movies' DataFrame named 'movies_prac'. Work with 'movies_prac' for these exercises.

movies_prac = movies.copy()  # the .copy() method creates another object

In [35]:

movies_prac = movies.copy()

Get rid of any leading or trailing white space in the 'title variable by using the .strip() function.

In [36]:

movies_prac['title'] = movies_prac['title'].str.strip()

Extract the four-digit year from the titles and put them into a new column named 'year'. You might try str.slice().

In [37]:

movies_prac['year'] = movies_prac['title'].str.slice(-5, -1) # The negative notation counts backwards. 
                                                             # I am relying on the fact that each title ends in '(year)'.
                                                             # Not the most robust method.

There are 12 movies that do not have a year. Assign "-1" to those. I suggest using .loc[] and create a boolean vector to identifying these observations using
```
~movies_prac['year'].str.isdigit()
```

In [38]:

movies_prac.loc[~movies_prac['year'].str.isdigit(),'year'] = -1 # The ~ is the boolean for "not"
                                                                # movies_prac['year'].str.isdigit() creates a series of TRUE and FALSE.
                                                                # ~movies_prac['year'].str.isdigit() picks out the entries with FALSE.

Drop the movies without years.

In [39]:

movies_prac = movies_prac[ movies_prac['year']!=-1 ] 

Convert the year data to int.

In [40]:

movies_prac['year'] = movies_prac['year'].astype(int)

'Bin' the years into decades using .cut() using pre-defined labels:
```
decades = [1899, 1910, 1920, 1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010, 2020]
```
Create a column called 'decade' to store the binned values.

In [41]:

decades = [1899, 1910, 1920, 1930, 1940, 1950, 1960, 1970, 1980, 1990, 2000, 2010, 2020]
movies_prac['decade'] = pd.cut(movies_prac['year'], decades)
movies_prac.head()

Out[41]:

	movieId	title	genres	genre_split	year	decade
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	[Adventure, Animation, Children, Comedy, Fantasy]	1995	(1990, 2000]
1	2	Jumanji (1995)	Adventure\|Children\|Fantasy	[Adventure, Children, Fantasy]	1995	(1990, 2000]
2	3	Grumpier Old Men (1995)	Comedy\|Romance	[Comedy, Romance]	1995	(1990, 2000]
3	4	Waiting to Exhale (1995)	Comedy\|Drama\|Romance	[Comedy, Drama, Romance]	1995	(1990, 2000]
4	5	Father of the Bride Part II (1995)	Comedy	[Comedy]	1995	(1990, 2000]

Load 'ratings.csv' from the MovieLens data and merge it to movies_prac to create a new DataFrame named movies_2. Use an inner merge using 'movieId' as the key.

In [42]:

rate = pd.read_csv('./Data/MovieLens/ratings.csv')

movies_2 = pd.merge(left=movies_prac, right=rate, on='movieId', how='inner')
movies_2.head()

Out[42]:

	movieId	title	genres	genre_split	year	decade	userId	rating	timestamp
0	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	[Adventure, Animation, Children, Comedy, Fantasy]	1995	(1990, 2000]	1	4.0	964982703
1	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	[Adventure, Animation, Children, Comedy, Fantasy]	1995	(1990, 2000]	5	4.0	847434962
2	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	[Adventure, Animation, Children, Comedy, Fantasy]	1995	(1990, 2000]	7	4.5	1106635946
3	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	[Adventure, Animation, Children, Comedy, Fantasy]	1995	(1990, 2000]	15	2.5	1510577970
4	1	Toy Story (1995)	Adventure\|Animation\|Children\|Comedy\|Fantasy	[Adventure, Animation, Children, Comedy, Fantasy]	1995	(1990, 2000]	17	4.5	1305696483

Run the following code. What does it do?

means = movies_2.groupby('decade')['rating'].mean()

In [43]:

means = movies_2.groupby('decade')['rating'].mean()

Time to create a graphic to answer our question: Which era had the best Movies? Create a horizontal bar chart with the average movie ratings. The index of means is an interval object. To assign it to the y-axis labels, convert it to a string
```
ax.barh(means.index.astype(str), means)
```

In [44]:

fig, ax = plt.subplots(figsize=(10,5))

ax.barh(means.index.astype(str), means, color = 'red', alpha = 0.25 )

ax.spines['top'].set_visible(False)
ax.spines['right'].set_visible(False)

ax.set_title('Average movie rating by decade')

ax.grid(axis = 'x', color='white')

plt.show()

Lecture 14: Cleaning and transforming data [SUGGESTED SOLUTIONS]

1. American Time Use Survey (ATUS) (return)¶

2. Useful methods to work with and clean numerical data (return)¶

A word on loops¶

replace( )¶

unique( )¶

Apply a function to a column: map()¶

Apply a function to a DataFrame: applymap( )¶

Practice</font >

3. Methods to create categorical variables (return)¶

Binning¶

Quantiles¶

4. Functions to work with and clean strings (return)¶

String methods¶

str.contains( )¶

str.split( )¶

str.join ( )¶

Practice</font >

Jeff Thurk // jeff.thurk@uga.edu // Department of Economics // University of Georgia