Lecture 8: Reading \& writing data with Pandas [SUGGESTED SOLUTIONS]

We have seen some of the basic things we can do with Pandas. In doing so, we created some simple DataFrames from dicts. That was simple, but it is almost never how we create DataFrames in the wild.

Most data live in files, often as text files, comma-sparated values or as MS Excel workbooks either on our computers or in the cloud. In this notebook, we will learn how to get data into (and out of) Pandas.

Today, we will cover:

Reading data from your computer

Reading data from the internet

Class Announcements

PS2 (due 9/18) is on eLC.

Files Needed

The files below are located on the course website and should be downloaded to your working directory prior to class.

File	Description
gdp_components.csv	GDP components
debt.xlsx	GDP, Federal Debt, 10-year treasury rates

1. Reading from your computer (top)¶

Let's start by getting files from our own computers. We start by loading Pandas. We are also loading the os package. os means 'operating system' and it contains functions that help us navigate the file structure of our computers.

In [1]:

import pandas as pd     # load the pandas package and call it pd
import os               # The package name is already short enough. No need to rename it. 

# The following is a bit of extra code to allow you to download the PISA scores from the internet at the end
# of the lecture. It's telling the computer to ignore ssl certificate errors.
import ssl
if (not os.environ.get('PYTHONHTTPSVERIFY', '') and getattr(ssl, '_create_unverified_context', None)):
    ssl._create_default_https_context = ssl._create_unverified_context

If you have not already, move the gdp_components.csv file to your local drive and put it in the same folder that holds this notebook. We expect this file to contain U.S. GDP and its major components. Let's see.

In [2]:

gdp = pd.read_csv('gdp_components.csv')       # read_csv is a part of Pandas, so we need the pd. 
print(type(gdp))                              # What have we got here?

<class 'pandas.core.frame.DataFrame'>

This looks successful. read_csv() takes a string with the file name and creates a DataFrame. Let's take a look at the data.

In [3]:

print(gdp)

          DATE       GDPA     GPDIA      GCEA    EXPGSA    IMPGSA
0   1929-01-01    104.556    17.170     9.622     5.939     5.556
1   1930-01-01     92.160    11.428    10.273     4.444     4.121
2   1931-01-01     77.391     6.549    10.169     2.906     2.905
3   1932-01-01     59.522     1.819     8.946     1.975     1.932
4   1933-01-01     57.154     2.276     8.875     1.987     1.929
..         ...        ...       ...       ...       ...       ...
84  2013-01-01  16784.851  2826.013  3132.409  2273.428  2764.210
85  2014-01-01  17521.747  3038.931  3167.041  2371.027  2879.284
86  2015-01-01  18219.297  3211.971  3234.210  2265.047  2786.461
87  2016-01-01  18707.189  3169.887  3290.979  2217.576  2738.146
88  2017-01-01  19485.394  3367.965  3374.444  2350.175  2928.596

[89 rows x 6 columns]

Even though jupyter notebook hid rows 30-58, this is still a bit obnoxious. We can use the head() and tail() methods of DataFrame to peek at just the first or last few rows.

In [4]:

gdp.head(4)            # Show the first 4 rows.

Out[4]:

	DATE	GDPA	GPDIA	GCEA	EXPGSA	IMPGSA
0	1929-01-01	104.556	17.170	9.622	5.939	5.556
1	1930-01-01	92.160	11.428	10.273	4.444	4.121
2	1931-01-01	77.391	6.549	10.169	2.906	2.905
3	1932-01-01	59.522	1.819	8.946	1.975	1.932

If you do not pass head() or tail() an argument, it defaults to 5 rows.

In [5]:

gdp.tail()

Out[5]:

	DATE	GDPA	GPDIA	GCEA	EXPGSA	IMPGSA
84	2013-01-01	16784.851	2826.013	3132.409	2273.428	2764.210
85	2014-01-01	17521.747	3038.931	3167.041	2371.027	2879.284
86	2015-01-01	18219.297	3211.971	3234.210	2265.047	2786.461
87	2016-01-01	18707.189	3169.887	3290.979	2217.576	2738.146
88	2017-01-01	19485.394	3367.965	3374.444	2350.175	2928.596

You can also ask for a random sample of the rows using .sample() where the input argument is, you guessed it, the number of rows. sample() doesn't include a default value so you must put something in it (e.g., 5).

In [6]:

gdp.sample(5)

Out[6]:

	DATE	GDPA	GPDIA	GCEA	EXPGSA	IMPGSA
39	1968-01-01	940.651	156.923	225.526	47.906	46.556
43	1972-01-01	1279.110	228.140	286.189	70.843	74.216
80	2009-01-01	14448.932	1929.664	3073.512	1581.996	1978.447
46	1975-01-01	1684.904	257.253	381.128	138.706	122.730
7	1936-01-01	84.830	9.391	13.398	3.007	3.154

The index isn't very sensible. This is time series data (the unit of observation is a year), so the date seems like a good index. How do we set the index?

In [7]:

gdp_new_index = gdp.set_index('DATE')   # We could use 'inplace = True' if we didn't need a copy.

print(gdp_new_index.head())

               GDPA   GPDIA    GCEA  EXPGSA  IMPGSA
DATE                                               
1929-01-01  104.556  17.170   9.622   5.939   5.556
1930-01-01   92.160  11.428  10.273   4.444   4.121
1931-01-01   77.391   6.549  10.169   2.906   2.905
1932-01-01   59.522   1.819   8.946   1.975   1.932
1933-01-01   57.154   2.276   8.875   1.987   1.929

We can also set the index as we read in the file. Let's take a look at the read_csv() function.

In [8]:

gdp.read_csv?

Object `gdp.read_csv` not found.

I'm seeing a lot of good stuff here. index_col, usecols, header, sep,...some stuff I don't know about, too. When reading in messy files, these extra arguments may come in handy.

Let's give index_col a try.

In [9]:

gdp_2 = pd.read_csv('gdp_components.csv', index_col = 0)    # Treat the CSV like a DataFrame. Count cols staring with 0

In [10]:

gdp_2.head()

Out[10]:

	GDPA	GPDIA	GCEA	EXPGSA	IMPGSA
DATE
1929-01-01	104.556	17.170	9.622	5.939	5.556
1930-01-01	92.160	11.428	10.273	4.444	4.121
1931-01-01	77.391	6.549	10.169	2.906	2.905
1932-01-01	59.522	1.819	8.946	1.975	1.932
1933-01-01	57.154	2.276	8.875	1.987	1.929

Navigating your file structure¶

We dumped our file into our current working directory so we could just ask for the file name gdp_components.csv in read_csv(). What is our current working directory (cwd)?

In [11]:

path_to_cwd = os.getcwd()           # getcwd() is part of the os package we imported earlier
print(path_to_cwd)

C:\Users\jt83241\Dropbox\Teaching\Undergrad\Data Science for Business and Economics\Lectures

When we gave read_csv() gpd_components.csv, it looked in our cwd for the file. Let's try something more complicated. Make a copy of the gdp_components file and paste it into your Data folder. Rename the file gdp_components_moved.csv.

In [12]:

gdp_moved = pd.read_csv('gdp_components_moved.csv')

---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
Input In [12], in <cell line: 1>()
----> 1 gdp_moved = pd.read_csv('gdp_components_moved.csv')

File ~\Anaconda3\lib\site-packages\pandas\util\_decorators.py:311, in deprecate_nonkeyword_arguments.<locals>.decorate.<locals>.wrapper(*args, **kwargs)
    305 if len(args) > num_allow_args:
    306     warnings.warn(
    307         msg.format(arguments=arguments),
    308         FutureWarning,
    309         stacklevel=stacklevel,
    310     )
--> 311 return func(*args, **kwargs)

File ~\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py:680, in read_csv(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, encoding_errors, dialect, error_bad_lines, warn_bad_lines, on_bad_lines, delim_whitespace, low_memory, memory_map, float_precision, storage_options)
    665 kwds_defaults = _refine_defaults_read(
    666     dialect,
    667     delimiter,
   (...)
    676     defaults={"delimiter": ","},
    677 )
    678 kwds.update(kwds_defaults)
--> 680 return _read(filepath_or_buffer, kwds)

File ~\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py:575, in _read(filepath_or_buffer, kwds)
    572 _validate_names(kwds.get("names", None))
    574 # Create the parser.
--> 575 parser = TextFileReader(filepath_or_buffer, **kwds)
    577 if chunksize or iterator:
    578     return parser

File ~\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py:933, in TextFileReader.__init__(self, f, engine, **kwds)
    930     self.options["has_index_names"] = kwds["has_index_names"]
    932 self.handles: IOHandles | None = None
--> 933 self._engine = self._make_engine(f, self.engine)

File ~\Anaconda3\lib\site-packages\pandas\io\parsers\readers.py:1217, in TextFileReader._make_engine(self, f, engine)
   1213     mode = "rb"
   1214 # error: No overload variant of "get_handle" matches argument types
   1215 # "Union[str, PathLike[str], ReadCsvBuffer[bytes], ReadCsvBuffer[str]]"
   1216 # , "str", "bool", "Any", "Any", "Any", "Any", "Any"
-> 1217 self.handles = get_handle(  # type: ignore[call-overload]
   1218     f,
   1219     mode,
   1220     encoding=self.options.get("encoding", None),
   1221     compression=self.options.get("compression", None),
   1222     memory_map=self.options.get("memory_map", False),
   1223     is_text=is_text,
   1224     errors=self.options.get("encoding_errors", "strict"),
   1225     storage_options=self.options.get("storage_options", None),
   1226 )
   1227 assert self.handles is not None
   1228 f = self.handles.handle

File ~\Anaconda3\lib\site-packages\pandas\io\common.py:789, in get_handle(path_or_buf, mode, encoding, compression, memory_map, is_text, errors, storage_options)
    784 elif isinstance(handle, str):
    785     # Check whether the filename is to be opened in binary mode.
    786     # Binary mode does not support 'encoding' and 'newline'.
    787     if ioargs.encoding and "b" not in ioargs.mode:
    788         # Encoding
--> 789         handle = open(
    790             handle,
    791             ioargs.mode,
    792             encoding=ioargs.encoding,
    793             errors=errors,
    794             newline="",
    795         )
    796     else:
    797         # Binary mode
    798         handle = open(handle, ioargs.mode)

FileNotFoundError: [Errno 2] No such file or directory: 'gdp_components_moved.csv'

Of course this doesn't work. The file is not in our cwd. It's good see what that kind of error message looks like. We need to pass csv_read() the path to the file. The path is the hierarchy of folders that contains the file. In my case, the path is

C:\Users\jt83241\Dropbox\Teaching\Undergrad\Data Science for Business and Economics\Lectures

Note that there is a / each time we list a new folder. We can also use ./ where the dot means the cwd. Ok, I lied a little. Windows uses backslashes while the rest of the world uses forward slashes. Here is why article.

In [ ]:

gdp_moved = pd.read_csv('./Data/gdp_components_moved.csv')
gdp_moved.head()

We could have manipulated some strings to get to this, too. This approach might be useful if you needed to read in many files from the same place. (Maybe using a for loop and a list of file names?)

In [13]:

path_to_cwd = os.getcwd()
file_name = 'gdp_components_moved.csv'
path_to_data_file = path_to_cwd + '/Data/' +  file_name  
print(path_to_data_file)

C:\Users\jt83241\Dropbox\Teaching\Undergrad\Data Science for Business and Economics\Lectures/Data/gdp_components_moved.csv

In [14]:

gdp_moved = pd.read_csv(path_to_data_file, index_col=0)
gdp_moved.head()

Out[14]:

	GDPA	GPDIA	GCEA	EXPGSA	IMPGSA
DATE
1929-01-01	104.556	17.170	9.622	5.939	5.556
1930-01-01	92.160	11.428	10.273	4.444	4.121
1931-01-01	77.391	6.549	10.169	2.906	2.905
1932-01-01	59.522	1.819	8.946	1.975	1.932
1933-01-01	57.154	2.276	8.875	1.987	1.929

Practice: Reading (and Writing) CSVs ¶

Take a few minutes and try the following. Feel free to chat with those around if you get stuck.

Try out the to_csv() method of DataFrame. Save gdp_moved as 'gdp_moved_2.csv' in your cwd. [You can use ? if you need help.]

In [15]:

gdp_moved.to_csv('gpd_moved_2.csv')

Use to_csv() again to save gdp_moved to the Data folder. Name it 'gdp_moved_3.csv'

In [16]:

gdp_moved.to_csv('./Data/gdp_moved_3.csv')

Are your files in the correct places?

Isn't this supposed to be practice reading in CSV files? Right. Let's do some of that.

Use gdp_moved_3.csv to create a DataFrame named gdp_growth. Set the index to the dates. Print out the first 10 years of data.

In [17]:

gdp_growth = pd.read_csv('./Data/gdp_moved_3.csv', index_col=0)
gdp_growth.head(10)

Out[17]:

	GDPA	GPDIA	GCEA	EXPGSA	IMPGSA
DATE
1929-01-01	104.556	17.170	9.622	5.939	5.556
1930-01-01	92.160	11.428	10.273	4.444	4.121
1931-01-01	77.391	6.549	10.169	2.906	2.905
1932-01-01	59.522	1.819	8.946	1.975	1.932
1933-01-01	57.154	2.276	8.875	1.987	1.929
1934-01-01	66.800	4.296	10.721	2.561	2.239
1935-01-01	74.241	7.370	11.151	2.769	2.982
1936-01-01	84.830	9.391	13.398	3.007	3.154
1937-01-01	93.003	12.967	13.119	4.039	3.961
1938-01-01	87.352	7.944	14.170	3.811	2.845

Rename 'GDPA' to 'gdp' and rename 'GCEA' to 'gov'

In [18]:

gdp_growth.rename(columns={'GDPA':'gdp', 'GCEA':'gov'}, inplace=True)
print(gdp_growth.head())

                gdp   GPDIA     gov  EXPGSA  IMPGSA
DATE                                               
1929-01-01  104.556  17.170   9.622   5.939   5.556
1930-01-01   92.160  11.428  10.273   4.444   4.121
1931-01-01   77.391   6.549  10.169   2.906   2.905
1932-01-01   59.522   1.819   8.946   1.975   1.932
1933-01-01   57.154   2.276   8.875   1.987   1.929

Reading Excel spreadsheets¶

Reading spreadsheets isn't much different than reading csv files. But, since workbooks are more complicated than csv files, we have a few more options to consider.

If you haven't already, copy over debt.xlsx to your cwd. Let's open it in Excel and have a look at it...

There's a lot going on here: missing data, some #N/A stuff, and several header rows. Let's get to work.

In [19]:

debt = pd.read_excel('./Data/debt.xlsx')
debt.head(10)

Out[19]:

	FRED Graph Observations	Unnamed: 1	Unnamed: 2	Unnamed: 3
0	Federal Reserve Economic Data	NaN	NaN	NaN
1	Link: https://fred.stlouisfed.org	NaN	NaN	NaN
2	Help: https://fred.stlouisfed.org/help-faq	NaN	NaN	NaN
3	Economic Research Division	NaN	NaN	NaN
4	Federal Reserve Bank of St. Louis	NaN	NaN	NaN
5	NaN	NaN	NaN	NaN
6	GDPA	Gross Domestic Product, Billions of Dollars, A...	NaN	NaN
7	GFDEBTN	Federal Debt: Total Public Debt, Millions of D...	NaN	NaN
8	DGS10	10-Year Treasury Constant Maturity Rate, Perce...	NaN	NaN
9	NaN	NaN	NaN	NaN

This type of thing is probably pretty common so my guess is someone wrote pandas code to load the data more efficienctly. Let's learn more about pd.read_excel to see if we can figure out how better to load the data.

In [20]:

pd.read_excel?

Let's use the header argument to specify the row to use as the column names and -- as usual -- the notation is zero-based.

In [21]:

debt = pd.read_excel('./Data/debt.xlsx', header = 12)

print(debt.head())
print('\n')
print(debt.tail())

  observation_date     GDPA  GFDEBTN  DGS10
0       1929-01-01  104.556      NaN    NaN
1       1930-01-01   92.160      NaN    NaN
2       1931-01-01   77.391      NaN    NaN
3       1932-01-01   59.522      NaN    NaN
4       1933-01-01   57.154      NaN    NaN


   observation_date       GDPA      GFDEBTN     DGS10
85       2014-01-01  17521.747  17799837.00  2.539560
86       2015-01-01  18219.297  18344212.75  2.138287
87       2016-01-01  18707.189  19549200.50  1.837440
88       2017-01-01  19485.394  20107155.25  2.329480
89       2018-01-01        NaN          NaN       NaN

That's looking good. Notice that Pandas added NaN for the missing data and for those #N\A entries. We will have to deal with those at some point. The header parameter is part of read_csv(), too.

We didn't specify which sheet in the workbook to load, so Pandas took the first one. The documentation says we can ask for sheets by name.

In [22]:

debt_q = pd.read_excel('./Data/debt.xlsx', header=12, sheet_name='quarterly')
print(debt_q.head())
print('\n')
print(debt_q.tail())

  observation_date  GFDEBTN  DGS10      GDP
0       1947-01-01      NaN    NaN  243.164
1       1947-04-01      NaN    NaN  245.968
2       1947-07-01      NaN    NaN  249.585
3       1947-10-01      NaN    NaN  259.745
4       1948-01-01      NaN    NaN  265.742


    observation_date     GFDEBTN     DGS10        GDP
281       2017-04-01  19844554.0  2.260952  19359.123
282       2017-07-01  20244900.0  2.241429  19588.074
283       2017-10-01  20492747.0  2.371452  19831.829
284       2018-01-01  21089643.0  2.758525  20041.047
285       2018-04-01  21195070.0  2.920625  20411.924

The documentation also tells us that we can ask for just a subset of the columns when reading in a file (csv or xlsx) using the usecols argument. This takes either integers or Excel column letters.

In [23]:

# Take the first and third columns of sheet 'quarterly'

interest_rates = pd.read_excel('./Data/debt.xlsx', header=12,  sheet_name='quarterly', usecols=[0,2])  
interest_rates.head()

Out[23]:

	observation_date	DGS10
0	1947-01-01	NaN
1	1947-04-01	NaN
2	1947-07-01	NaN
3	1947-10-01	NaN
4	1948-01-01	NaN

Practice: Reading Excel ¶

Take a few minutes and try the following. Feel free to chat with those around if you get stuck.

Read in the quarterly data from 'debt.xlsx' and keep only the columns with the date, gdp, and GFDEBTN. Name your new DataFrame fed_debt.

In [24]:

fed_debt = pd.read_excel('./Data/debt.xlsx', header=12,  sheet_name='quarterly', usecols=[0,1,3])
fed_debt.head()

Out[24]:

	observation_date	GFDEBTN	GDP
0	1947-01-01	NaN	243.164
1	1947-04-01	NaN	245.968
2	1947-07-01	NaN	249.585
3	1947-10-01	NaN	259.745
4	1948-01-01	NaN	265.742

Oops, I wanted to set the observation_date to the index. Go back and add that to your solution to 1.

In [25]:

fed_debt = pd.read_excel('./Data/debt.xlsx', header=12,  sheet_name='quarterly', usecols=[0,1,3], index_col=0)
fed_debt.head()

Out[25]:

	GFDEBTN	GDP
observation_date
1947-01-01	NaN	243.164
1947-04-01	NaN	245.968
1947-07-01	NaN	249.585
1947-10-01	NaN	259.745
1948-01-01	NaN	265.742

What is 'GFDEBTN'? It is the federal debt, in millions. Rename this variable to 'DEBT'

In [26]:

fed_debt.rename(columns={'GFDEBTN':'DEBT'}, inplace=True)
fed_debt.head()

Out[26]:

	DEBT	GDP
observation_date
1947-01-01	NaN	243.164
1947-04-01	NaN	245.968
1947-07-01	NaN	249.585
1947-10-01	NaN	259.745
1948-01-01	NaN	265.742

Create a variable name debt_ratio that is the debt to GDP ratio. Debt is in millions and gdp is in billions. Adjust accordingly. Print the last eight rows.

In [27]:

fed_debt['debt_ratio'] = (fed_debt['DEBT']/1000)/fed_debt['GDP']
fed_debt.sample(8)

Out[27]:

	DEBT	GDP	debt_ratio
observation_date
1970-01-01	372007.0	1051.200	0.353888
1995-04-01	4951372.0	7580.997	0.653129
1966-07-01	324748.0	819.638	0.396209
2001-10-01	5943439.0	10660.294	0.557530
2007-01-01	8849665.0	14208.569	0.622840
1966-10-01	329319.0	833.302	0.395198
1952-10-01	NaN	380.812	NaN
1981-01-01	964531.0	3124.206	0.308728

There are a lot of missing debt values. Did Pandas throw an error? No. Pandas knows (in some cases) how to work around missing data.

Summarize the debt_ratio variable using the .describe() method. What is its max level? Its min?

In [28]:

print(fed_debt['debt_ratio'].describe())

count    210.000000
mean       0.564994
std        0.227520
min        0.306033
25%        0.355102
50%        0.555767
75%        0.641648
max        1.052562
Name: debt_ratio, dtype: float64

2. Reading from the internet (top)¶

We can pass read functions urls, too.

In [29]:

# Read in the penn world table data
url = "http://www.rug.nl/ggdc/docs/pwt81.xlsx"
pwt = pd.read_excel(url, sheet_name= "Data")
pwt.head()

Out[29]:

	countrycode	country	currency_unit	year	rgdpe	rgdpo	pop	emp	avh	hc	...	csh_g	csh_x	csh_m	csh_r	pl_c	pl_i	pl_g	pl_x	pl_m	pl_k
0	AGO	Angola	Kwanza	1950	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	AGO	Angola	Kwanza	1951	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	AGO	Angola	Kwanza	1952	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
3	AGO	Angola	Kwanza	1953	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
4	AGO	Angola	Kwanza	1954	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 47 columns

That took a few seconds --- this is a pretty big file. Let's try an example from McKinney's book: baby name counts for a year. I wonder what was trendy in 1880.

In [30]:

baby_url = 'https://raw.githubusercontent.com/wesm/pydata-book/2nd-edition/datasets/babynames'

# What was trendy in 1880?
old_names = pd.read_csv(baby_url + '//yob1880.txt')
old_names.head()

Out[30]:

	Mary	F	7065
0	Anna	F	2604
1	Emma	F	2003
2	Elizabeth	F	1939
3	Minnie	F	1746
4	Margaret	F	1578

We've lost Mary, which looks pretty popular. It was used as a column label. Not good. Let's fix this by specifying header=None to tell the computer to not import a header (i.e., don't import column names).

In [31]:

old_names = pd.read_csv(baby_url + '//yob1880.txt', header=None)
old_names.head()

Out[31]:

	0	1	2
0	Mary	F	7065
1	Anna	F	2604
2	Emma	F	2003
3	Elizabeth	F	1939
4	Minnie	F	1746

Practice: Reading from the Internet ¶

Take a few minutes and try the following. Feel free to chat with those around if you get stuck.

Baby names¶

Get the baby name data for 2009. Call the DataFrame new_names. Give the columns some reasonable names.

In [32]:

old_names = pd.read_csv(baby_url + '//yob2009.txt', header=None)
old_names.head()

Out[32]:

	0	1	2
0	Isabella	F	22222
1	Emma	F	17830
2	Olivia	F	17374
3	Sophia	F	16869
4	Ava	F	15826

In [33]:

old_names.rename(columns={0:'name', 1:'sex', 2:'count'}, inplace=True)
old_names.head()

Out[33]:

	name	sex	count
0	Isabella	F	22222
1	Emma	F	17830
2	Olivia	F	17374
3	Sophia	F	16869
4	Ava	F	15826

What are the two most popular female names in 1880? Try the .sort_values() method but be sure to learn about it first...

In [34]:

# 1. Isolate female names
old_names_female=old_names[old_names['sex']=='F']

# 2. Sort in descending order
old_names_female_sorted = old_names_female.sort_values(by=['count'], ascending=False)
old_names_female_sorted.head(2)

Out[34]:

	name	sex	count
0	Isabella	F	22222
1	Emma	F	17830

What are the two least popular female names in 1880? Similar drill. If you got the answer to (2), this will be easy.

In [35]:

old_names_female_sorted.tail(2)

Out[35]:

	name	sex	count
18193	Gilda	F	5
20122	Zyriel	F	5

Pisa Scores¶

Run the following code to create a DataFrame with mean scores in math, reading, and science.

url = 'http://dx.doi.org/10.1787/888932937035'
pisa = pd.read_excel(url,
                     usecols=[0,1,9,13])   # select columns of interest

Note that we're not setting the index yet. There is some junk at the top and bottom of the spreadsheet.

Use the the skipheader and skipfooter arguments to find the good data.

In [41]:

url = 'http://dx.doi.org/10.1787/888932937035'
pisa = pd.read_excel(url,
                     skiprows=18,             # skip the first 18 rows
                     skipfooter=7,            # skip the last 7
                     usecols=[0,1,9,13],   # select columns of interest
                     #index_col=0,             # set the index as the first column
                     #header=[0,1]             # set the variable names
                     )

pisa

Out[41]:

	Unnamed: 0	Mathematics	Reading	Science
0	NaN	Mean score in PISA 2012	Mean score in PISA 2012	Mean score in PISA 2012
1	NaN	NaN	NaN	NaN
2	OECD average	494.046447	496.462864	501.159793
3	NaN	NaN	NaN	NaN
4	Shanghai-China	612.675536	569.588408	580.117831
...	...	...	...	...
64	Jordan	385.595556	399.034747	409.367215
65	Colombia	376.488601	403.402534	398.678632
66	Qatar	376.448399	387.504131	383.642553
67	Indonesia	375.114452	396.120095	381.911485
68	Peru	368.102547	384.151223	373.113448

69 rows × 4 columns

There are a lot of missing values. Let's clean-up the data. Use the .dropna() method to get rid of rows with NaN. As always, be sure to learn about the function so you can use if properly.

In [42]:

pisa2 = pisa.dropna()
pisa2

Out[42]:

	Unnamed: 0	Mathematics	Reading	Science
2	OECD average	494.046447	496.462864	501.159793
4	Shanghai-China	612.675536	569.588408	580.117831
5	Singapore	573.468314	542.215834	551.493157
6	Hong Kong-China	561.241096	544.600086	554.937434
7	Chinese Taipei	559.824796	523.118904	523.314904
...	...	...	...	...
64	Jordan	385.595556	399.034747	409.367215
65	Colombia	376.488601	403.402534	398.678632
66	Qatar	376.448399	387.504131	383.642553
67	Indonesia	375.114452	396.120095	381.911485
68	Peru	368.102547	384.151223	373.113448

66 rows × 4 columns

Make the country names the index.

In [43]:

pisa2.set_index('Unnamed: 0', inplace=True)
pisa2

Out[43]:

	Mathematics	Reading	Science
Unnamed: 0
OECD average	494.046447	496.462864	501.159793
Shanghai-China	612.675536	569.588408	580.117831
Singapore	573.468314	542.215834	551.493157
Hong Kong-China	561.241096	544.600086	554.937434
Chinese Taipei	559.824796	523.118904	523.314904
...	...	...	...
Jordan	385.595556	399.034747	409.367215
Colombia	376.488601	403.402534	398.678632
Qatar	376.448399	387.504131	383.642553
Indonesia	375.114452	396.120095	381.911485
Peru	368.102547	384.151223	373.113448

66 rows × 3 columns

What percent is the US pisa score as compared to the "OECD average"?

In [44]:

print(pisa2.loc['United States']/pisa2.loc['OECD average'])

Mathematics       0.974335
Reading           1.002254
Science           0.992517
dtype: object

Challenging. How correlated are pisa math, reading, and science scores with each other? Write the correlation matrix to a file called 'pisa_corr.csv' using .to_csv(). This is a challenging question because, depending on how you read in the data, your columns are probably of type 'Object' and corr() won't work. Convert the three columns to numbers and then find the correlation matrix using the .corr() method.

In [46]:

pisa2.columns = ['math', 'read', 'sci'] # changing the column names (not necessary)

# Brute force
for var in pisa2.columns:
    pisa2[var] = pisa2[var].astype('float') 
pisa2.corr()

C:\Users\jt83241\AppData\Local\Temp\ipykernel_20352\3479768300.py:5: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  pisa2[var] = pisa2[var].astype('float')

Out[46]:

	math	read	sci
math	1.000000	0.959806	0.972131
read	0.959806	1.000000	0.978559
sci	0.972131	0.978559	1.000000

In [47]:

# Fancier approach. The "apply" method applies the function "pd.to_numeric" to each column 
# in the list ['math', 'read', 'sci']. We can "apply" any function, even ones we've created.
pisa2=pisa2[['math', 'read', 'sci']].apply(pd.to_numeric)
pisa2.corr()

Out[47]:

	math	read	sci
math	1.000000	0.959806	0.972131
read	0.959806	1.000000	0.978559
sci	0.972131	0.978559	1.000000

In [48]:

# Save as csv
pisa2.to_csv('./pisa_corr.csv')

Lecture 8: Reading \& writing data with Pandas [SUGGESTED SOLUTIONS]

Class Announcements

Files Needed

1. Reading from your computer (top)¶

Navigating your file structure¶

Practice: Reading (and Writing) CSVs ¶

Reading Excel spreadsheets¶

Practice: Reading Excel ¶

2. Reading from the internet (top)¶

Practice: Reading from the Internet ¶

Baby names¶

Pisa Scores¶

Jeff Thurk // jeff.thurk@uga.edu // Department of Economics // University of Georgia