Lecture 7: The "Pandas" Package [SUGGESTED SOLUTIONS]
We have a handle on python now: we understand the data structures and enough about working with them to move on to stuff more directly relevant to data analysis. That means digging into pandas the package that provides the core data structures and functions for data work.
Pandas is a package. We have seen the numpy and scipy packages before, but we haven't discussed packages in any detail yet. Let's brush up on packages before we jump into pandas.
Today, we will cover:
1. Packages (top)¶
We can think of 'python' as a core set of libraries (the python standard libraries) that contain the basic python data structures (int
, str
, list
, dict
, etc.) and functions (print()
, len()
, sum()
, etc). We can add additonal data structures and functions to core python. Developers group related functions and data structures into packages wich we add to core python to increase its functionality.
On winstat, we are working with the Anaconda distribution of python. The Anaconda distribution bundles together basic python with common and useful packages such as
- numpy: functions for numerical computing (ln(), exp(), sqrt(),...)
- scipy: functions for math and engineering (statistical functions, signal processing, linear algebra...)
- matplotlib: a plotting interface (bar, scatter, time series plots,...)
- pandas: functions and data structures for data analysis (data management, merging, grouping...)
At the end of the semester, when you move away from winstat, you can install Anaconda (it's free) on your own computer and continue your python adventures.
Anaconda already installed the packages for us, but we still need to tell python that we want to use the packages. This way, we only load into memory the packages we need.
To add a package we use the import
statement. For example, to load the numpy package, we use:
import numpy as np
The statement says to add the numpy package (import it) and to give it a shortened name of 'np'. We don't have to use the name np --- we could name it George --- but np is short and informative.
Why do we want a shorter name than numpy? When we use a function from the numpy package, we use the dot notation. Here, we use the log function:
y = 10
x = np.log(y)
The np.
tells python to look for the log()
function in the package called np. Python knows that np means numpy, and everything works. [See why calling it George would have looked strange?]
We only need to import a package once. We typically group all of our imports at the begining of the code. Here we go:
import pandas as pd #load the pandas package and call it pd
import numpy as np #load the pandas package and call it np
When you run the cell, you might notice that the text to the left of the cell looked like In[*]
for a while. That * means that python was working. It takes a few seconds to load all those functions. Now run whos
to see what's in the namespace:
whos
Variable Type Data/Info ------------------------------ np module <module 'numpy' from 'C:\<...>ges\\numpy\\__init__.py'> pd module <module 'pandas' from 'C:<...>es\\pandas\\__init__.py'>
We see two module
variables ready for action. As usual, we can use ?
to check them out.
np?
random
seems useful. Let's see what's in there...
np.random?
Let's generate some random numbers.
np.random.random_sample(3)
array([0.20313687, 0.47875593, 0.28859266])
A full list of the packages included in the Anaconda distribution is here. If we need a package that is not already a part of Anaconda, we will need to install it. More on that later, but the big idea is that python can be extended by installing and adding packages.
2. Pandas (top)¶
Pandas (panel data) is our workhorse package for data anlaysis. In addition to all the things we can do with core python, pandas gives us many powerful functions and data types that we can use for working with data. If you are familiar with STATA, you will find many parallels between python and STATA. This makes sense: both are meant to handle panel data. Pandas was invented by Wes McKinney (sound familiar?) at the quantitative hedge fund AQR. One of the reasons we have chosen python for this course is its prevalence in the financial sector.
To make the pandas package available, we import
it into our code. We have already imported it above, but we can do it again. It won't hurt.
import pandas as pd # The most common short name for pandas is pd. We will go with that.
Pandas data structures: DataFrames¶
Pandas is built around two main data structures: the series and the dataframe. A dataframe is a 'rectangular' data structure, like a Microsoft Excel spreadsheet. It is made up of rows and columns. [But it is much, much, more powerful than a spreadsheet.]
Let's create a dataframe. To keep things simple (they will get complicated later!) we will create a dataframe from a dict.
data_dict = {'year': [1990, 1995, 2000, 2005 ], 'gdp':[5.9, 7.6, 10.2, 13.0], 'cons':[3.8, 4.9, 6.8, 8.7]}
print('First, print the dict:')
print(data_dict) # print the dict
df = pd.DataFrame(data_dict) # create a data frame called df from the dict.
# use the 'pd.' syntax to call the DataFrame constructor
print('\nNext, print the DataFrame and its type\n')
print(df) # Print the dataframe. Again, print() just knows what to do!
print('\n', type(df))
First, print the dict: {'year': [1990, 1995, 2000, 2005], 'gdp': [5.9, 7.6, 10.2, 13.0], 'cons': [3.8, 4.9, 6.8, 8.7]} Next, print the DataFrame and its type year gdp cons 0 1990 5.9 3.8 1 1995 7.6 4.9 2 2000 10.2 6.8 3 2005 13.0 8.7 <class 'pandas.core.frame.DataFrame'>
Nice. Compared to the dict, the dataframe is much easier to read when printed. Let's break it down:
df = pd.DataFrame(data_dict)
We are creating a DataFrame object. We can see that when we print out df's type. The stuff to the left of DataFrame in the print out is the hierarchy: DataFrame
is part of the frame
group which is part of the core
group of the package pandas
. The hierarchy is not important for our work.
We call the DataFrame()
(note the caps) function from the pandas (pd.
) package. The DataFrame()
function creates dataframes from other objects. If we don't pass an argument, it creates an empty DataFrame.
df2 = pd.DataFrame()
print(df2)
Empty DataFrame Columns: [] Index: []
We will use this function sometimes, but, more often, we will be building our DataFrames from spreadsheets, csv files, or api (application programming interface) calls directly to servers to retrive data. Very powerful stuff.
Columns and rows in a DataFrame¶
A DataFrame is made up of columns and rows. How big is our DataFrame? We use the shape
attribute of a DataFrame to see this. [Reminder: attributes and methods are associated with objects. Attributes return features of the object and methods are like functions that act on the object.]
print('The shape is:', df.shape) # shape returns the (rows, columns) of the DataFrame
print('The size is:', df.size) # I often mess up and try this when I want the 'size' of a DataFrame. What is it telling us?
The shape is: (4, 3) The size is: 12
What type does df.shape
return?
Back to our DataFrame. Let' print if out again, but this time, without the print function.
df
year | gdp | cons | |
---|---|---|---|
0 | 1990 | 5.9 | 3.8 |
1 | 1995 | 7.6 | 4.9 |
2 | 2000 | 10.2 | 6.8 |
3 | 2005 | 13.0 | 8.7 |
The jupyter notebook adds some shading and formatting, making it even easier to read.
The DataFrame Index¶
The left-most 'column' of numbers is not a column. It is the index that pandas assigned to our DataFrame: similar to the row 'numbers' of a spreadsheet. I put numbers in quotes, because an index can be more than just a sequence of integers. In this DataFrame, it makes sense for our index to be the year variable. A little later, we will discuss changing the index. For now, we will go with what pandas gave us.
The index is important; we will work on in more in a bit.
Pandas data structures: Series¶
We reference a column in the DataFrame by its name. Notice the similarity to the syntax for referencing a dict key.
print(df['gdp'])
0 5.9 1 7.6 2 10.2 3 13.0 Name: gdp, dtype: float64
When we print the column, we get the index, the column, and the type of data contained in the column. 'gdp' are floats. Suppose we instead ask for the type of the column itself --- not what's in the column. This reveals the second major data structure in pandas:
gdp = df['gdp']
print(gdp)
print(type(gdp))
0 5.9 1 7.6 2 10.2 3 13.0 Name: gdp, dtype: float64 <class 'pandas.core.series.Series'>
When we extract a single column from a DataFrame, we are given a Series.
print(type(df['gdp']))
<class 'pandas.core.series.Series'>
We can think of a Series as a DataFrame with only one variable: It is one column and an index. In the same way, we can think of a DataFrame as a collection of Series that all have the same index.
Learning more about a DataFrame¶
We have seen the size
and shape
methods of DataFrame. Let's look at a few more. If we want to see what variables we have in the DataFrame (imagine you are working with a dataset containing 1,000 variables...) we use the columns
attribute.
print(df.columns)
print(type(df.columns))
Index(['year', 'gdp', 'cons'], dtype='object') <class 'pandas.core.indexes.base.Index'>
So the column attribute returns the list of columns in a somewhat complicated way. (It is returning an Index object that is useful in some situations.) We can use the tolist()
method to create a list object. We can do something similar to the index as well. The axes
attribute tells us about both at once.
print(df.columns.tolist())
print(df.index.tolist())
print(df.axes)
['year', 'gdp', 'cons'] [0, 1, 2, 3] [RangeIndex(start=0, stop=4, step=1), Index(['year', 'gdp', 'cons'], dtype='object')]
We can learn about the data types in DataFrame with dtypes
[note the 's'].
print(df.dtypes)
year int64 gdp float64 cons float64 dtype: object
Practice: DataFrames ¶
Take a few minutes and try the following. Feel free to chat with those around you if you get stuck.
Below is a dict with data about U.S. states.
state_data = {'state':['CA','MI','WI','MN'], 'pop':[37,9.8,5.7, '5.3'], 'size':[163.7, 96.7, 65.5, 86.9], 'bird':['Quail', 'Redbreast Robin', 'American Robin', 'Loon']}
- Convert the dict to a DataFrame named
states
.
- What are the dtypes of the variables? Does anything look funny?
states = pd.DataFrame(state_data)
print(states.dtypes)
# Why is population an object and not a float?
state object pop object size float64 bird object dtype: object
- How many rows and columns are in this DataFrame?
print("state's rows and columns =", states.shape)
state's rows and columns = (4, 4)
- Print out the index as a list. Is there a variable that might make a more sensible index?
print("the index is ", states.index.tolist())
the index is [0, 1, 2, 3]
3. DataFrames (top)¶
More on dataframe columns¶
To extract a column from a DataFrame, we have been using states['bird']
. This is the preferred way to do so, in that it always works. You will come accros two other ways to extract a colulmn.
- The first uses a
.
like:states.bird
. - The second uses the
iloc
methodstates.iloc[:,3]
.
print(states['bird'])
0 Quail 1 Redbreast Robin 2 American Robin 3 Loon Name: bird, dtype: object
print(states.bird)
0 Quail 1 Redbreast Robin 2 American Robin 3 Loon Name: bird, dtype: object
print(states.iloc[:,3]) #iloc works like a 2-D slice with iloc[rows, cols].
0 Quail 1 Redbreast Robin 2 American Robin 3 Loon Name: bird, dtype: object
All three returned the same column. Why do I not recommend the second two approaches?
Let's try to grab the size column using the second method.
print(states.size) # Ask for the column 'size'
16
Well, that's not what I expected. On further review, however, it is exactly what I should have expected. size
is an attribute of DataFrame. We saw this earlier: it returns the number of elements in the DataFrame. Any column name that conflicts with a DataFrame method or attribute is going to cause problems.
What about iloc
? Suppose I decide to rearrange the order of the columns. (We will see why the following code works shortly.)
states = states[['state', 'size', 'bird', 'pop']]
states
state | size | bird | pop | |
---|---|---|---|---|
0 | CA | 163.7 | Quail | 37 |
1 | MI | 96.7 | Redbreast Robin | 9.8 |
2 | WI | 65.5 | American Robin | 5.7 |
3 | MN | 86.9 | Loon | 5.3 |
# What does this statement return?
print(states.iloc[:,3]) #iloc works like a 2-D slice with iloc[rows, cols].
0 37 1 9.8 2 5.7 3 5.3 Name: pop, dtype: object
Not surprisingly, this way of referencing a column is not robust to reordering the columns. We might still use iloc
here and there, but we do so at our own peril.
Subsets of columns¶
Okay, back to referencing columns. We can take several columns by passing a list of the column names.
cols_to_get = ['state', 'bird', 'pop']
got_cols = states[cols_to_get]
got_cols
type(got_cols)
pandas.core.frame.DataFrame
When we take more than one column, we are creating a DataFrame object. A more compact notation creates the list in place.
got_cols_2 = states[['state', 'bird', 'pop']]
got_cols_2
state | bird | pop | |
---|---|---|---|
0 | CA | Quail | 37 |
1 | MI | Redbreast Robin | 9.8 |
2 | WI | American Robin | 5.7 |
3 | MN | Loon | 5.3 |
Renaming columns¶
Often data sets come with poor variable names. How do we rename a column? If we are only changing a few variables, the dictionary approach works well.
# Is it population or state soda? Let's get a better name on that variable.
states = states.rename(columns={'pop':'population'})
states
state | size | bird | population | |
---|---|---|---|---|
0 | CA | 163.7 | Quail | 37 |
1 | MI | 96.7 | Redbreast Robin | 9.8 |
2 | WI | 65.5 | American Robin | 5.7 |
3 | MN | 86.9 | Loon | 5.3 |
The only column that was changed was the pop
column. Let's take a closer look at the syntax.
states = states.rename(columns={'pop':'population'})
- We are calling the
rename())
method from the DataFrame object - We pass
rename()
the argumentcolumns
(passingindex
would rename the index) - We use a dict to give the
{old name : new name}
key-value pairs
Notice that we had to assign the result of state.rename()
back to states. The rename()
method does not act on the original data, but creates a copy, which we assign back to the states
variable.
We can ask rename()
to perform the action on the original data with inplace
argument. [You can see if a method supports in place operations by checking the documentation, or using ? in jupyter notebook.]
#Let's rename population again...
states.rename(columns={'population':'people'}, inplace=True)
states
state | size | bird | people | |
---|---|---|---|---|
0 | CA | 163.7 | Quail | 37 |
1 | MI | 96.7 | Redbreast Robin | 9.8 |
2 | WI | 65.5 | American Robin | 5.7 |
3 | MN | 86.9 | Loon | 5.3 |
If we need to apply a method or function to all the column names, we can operate directly on df.columns
.
# Reminder, what does df.columns return?
print(states.columns)
# All caps
states.columns = [col_name.upper() for col_name in states.columns]
states
Index(['state', 'size', 'bird', 'people'], dtype='object')
STATE | SIZE | BIRD | PEOPLE | |
---|---|---|---|---|
0 | CA | 163.7 | Quail | 37 |
1 | MI | 96.7 | Redbreast Robin | 9.8 |
2 | WI | 65.5 | American Robin | 5.7 |
3 | MN | 86.9 | Loon | 5.3 |
We can now see that the columns
attribute returns an iterable object. Like a range or a list, states.columns
knows to dole out the column names when queired by a for loop. Nice.
More on DataFrame Rows¶
How do we take a row, or a subset of rows? We can resort to iloc
again...
print(states.iloc[2,:]) #Take the third row, all columns
STATE WI SIZE 65.5 BIRD American Robin PEOPLE 5.7 Name: 2, dtype: object
...but this is still subject to problems with reordering rows. It is also less likely that we want to take a row in a specific position. It is more likely we want a row corresponding to a conditional.
states[states['STATE']=='WI'] # take the row corresponding to Wisconsin
STATE | SIZE | BIRD | PEOPLE | |
---|---|---|---|---|
2 | WI | 65.5 | American Robin | 5.7 |
Let's break that down. First, we ask which rows have STATE equal to WI.
print( states['STATE']=='WI' )
print(type(states['STATE']=='WI'))
0 False 1 False 2 True 3 False Name: STATE, dtype: bool <class 'pandas.core.series.Series'>
That is returning a Series of bools. When we give the Series to the DataFrame, it only grabs the rows that are true.
Changing the row index¶
We can simplify our lives and get rid of an index that doesn't tell us anything useful about the data by changing the index. The unit of observation in states
is a state. That seems like a good index.
print('DataFrame before resetting the index:')
print(states)
states_new_index = states.set_index('STATE') # make the column 'STATE' the new index
# assign the newly indexed DF to a new variable
print('\nDataFrame after resetting the index:')
print(states_new_index)
DataFrame before resetting the index: STATE SIZE BIRD PEOPLE 0 CA 163.7 Quail 37 1 MI 96.7 Redbreast Robin 9.8 2 WI 65.5 American Robin 5.7 3 MN 86.9 Loon 5.3 DataFrame after resetting the index: SIZE BIRD PEOPLE STATE CA 163.7 Quail 37 MI 96.7 Redbreast Robin 9.8 WI 65.5 American Robin 5.7 MN 86.9 Loon 5.3
The index is now the state abreviations and not the squence of integers. What does this buy us? A simpler way to reference a column using the loc
method. This is loc, not iloc!
Notice that how we used set_index
did not operate "in place." Instead, it created a copy that I assigned to the new variable. We could have instead used inplace=True
in the set_index()
function.
print(states_new_index.loc['WI']) # ask for the row corresponding to WI
SIZE 65.5 BIRD American Robin PEOPLE 5.7 Name: WI, dtype: object
Very nice. Suppose you regret your choice of index. Are you ruined? Nope.
states_undo = states_new_index.reset_index() # reset the index to integers and return state to a column
states_undo
STATE | SIZE | BIRD | PEOPLE | |
---|---|---|---|---|
0 | CA | 163.7 | Quail | 37 |
1 | MI | 96.7 | Redbreast Robin | 9.8 |
2 | WI | 65.5 | American Robin | 5.7 |
3 | MN | 86.9 | Loon | 5.3 |
Deleting rows and columns¶
To remove a row or column, we use the drop()
method. The drop method is our first introdution to the axis argument. DataFrames have two axes. We have been calling them rows and columns, but pandas thinks of them as rows = 0 and columns = 1.
In the following we want to remove the column 'SIZE', but there could be a row index 'SIZE', too, so we need to tell drop where to look for 'SIZE'. Again, we use inplace if we don't want to create a copy.
states_subset = states.drop('SIZE', axis = 1)
states_subset
STATE | BIRD | PEOPLE | |
---|---|---|---|
0 | CA | Quail | 37 |
1 | MI | Redbreast Robin | 9.8 |
2 | WI | American Robin | 5.7 |
3 | MN | Loon | 5.3 |
Deleting a row works just as you would expect...
states_no_mn = states.drop(3, axis = 0) # Why is this code not very robust?
states_no_mn
STATE | SIZE | BIRD | PEOPLE | |
---|---|---|---|---|
0 | CA | 163.7 | Quail | 37 |
1 | MI | 96.7 | Redbreast Robin | 9.8 |
2 | WI | 65.5 | American Robin | 5.7 |
Practice: DataFrames ¶
Take a few minutes and try the following. Feel free to chat with those around you if you get stuck.
- Explain why
states = states[['state', 'size', 'bird', 'pop']]
reordered the DataFramestates
.
# We passed a list to the dataframe which corresponded to the column names index.
- Create a new DataFrame from the dict
data_dict
we used earlier. Name it data.
data = pd.DataFrame(data_dict)
data
year | gdp | cons | |
---|---|---|---|
0 | 1990 | 5.9 | 3.8 |
1 | 1995 | 7.6 | 4.9 |
2 | 2000 | 10.2 | 6.8 |
3 | 2005 | 13.0 | 8.7 |
- Change the index to be the year variable and print the DataFrame. Use the
inplace
argument.
data.set_index('year', inplace=True)
data
gdp | cons | |
---|---|---|
year | ||
1990 | 5.9 | 3.8 |
1995 | 7.6 | 4.9 |
2000 | 10.2 | 6.8 |
2005 | 13.0 | 8.7 |
- Change the remaining column names to 'consumption' and 'gross domestic product'
data.columns = ['consumption', 'gross domestic product']
data
consumption | gross domestic product | |
---|---|---|
year | ||
1990 | 5.9 | 3.8 |
1995 | 7.6 | 4.9 |
2000 | 10.2 | 6.8 |
2005 | 13.0 | 8.7 |
- Change column names to title case. Try using a list comprehension via the
.columns
method where you modify each string in the loop using thetitle()
method.
data.columns = [col.title() for col in data.columns]
data
Consumption | Gross Domestic Product | |
---|---|---|
year | ||
1990 | 5.9 | 3.8 |
1995 | 7.6 | 4.9 |
2000 | 10.2 | 6.8 |
2005 | 13.0 | 8.7 |
- Extract the rows corresponding to 1990 and 2000 and assign them to a DataFrame named
data_dec
. Hint: When we wanted to take to columns, we passed a list of the column names...
data_dec = data.loc[[1990, 2000]]
data_dec
Consumption | Gross Domestic Product | |
---|---|---|
year | ||
1990 | 5.9 | 3.8 |
2000 | 10.2 | 6.8 |
- Create the same dataframe by from data by deleting rows which do not correspond to 1990 and 2000. Name the new data frame
data_dec_2
.
data_dec_2 = data.drop(1995, axis=0)
data_dec_2 = data_dec_2.drop(2005, axis=0)
data_dec_2
Consumption | Gross Domestic Product | |
---|---|---|
year | ||
1990 | 5.9 | 3.8 |
2000 | 10.2 | 6.8 |
4. Calculations on DataFrames (top)¶
Now that we understand DataFrames, let's look at basic calculations on DataFrames. Don't worru, we'll cover more advanced calculation later.
The big idea here is that we never want to loop over the rows of a DataFrame to perform a calculation.
Pandas gives us lots of ways to perform simple and complex computations in a DataFrame. When we use the pandas' tools, there is a lot of optimized code working in the background. When we loop over the DataFrame on our own, we lose all that fast code and our program grinds away slowly.
A DataFrame naturally understands how to perform operations element-wise. For example, let's compute the share of consumption in GDP. We get started by loading up our little data dictionary and creating a DataFrame.
data = pd.DataFrame(data_dict) # reset data
data['cons_share'] = data['cons'] / data['gdp'] # this will divide cons by gdp for each row
data
year | gdp | cons | cons_share | |
---|---|---|---|---|
0 | 1990 | 5.9 | 3.8 | 0.644068 |
1 | 1995 | 7.6 | 4.9 | 0.644737 |
2 | 2000 | 10.2 | 6.8 | 0.666667 |
3 | 2005 | 13.0 | 8.7 | 0.669231 |
Notice the left-hand side of the assignment. I am creating a new variable (column) in the DataFrame and assigning it the consumption share.
# Oops, I wanted it in percentage
data['cons_share'] = data['cons_share']*100
data
year | gdp | cons | cons_share | |
---|---|---|---|---|
0 | 1990 | 5.9 | 3.8 | 64.406780 |
1 | 1995 | 7.6 | 4.9 | 64.473684 |
2 | 2000 | 10.2 | 6.8 | 66.666667 |
3 | 2005 | 13.0 | 8.7 | 66.923077 |
The +,-,/,* operators all work element wise. As we have seen, multiplying and dividing by scaler works fine, too.
DataFrame methods for simple operations¶
DataFrame has many methods for computing various statistics on the data. Note that some of them take an axis argument: you could compute sum()
across a row or a column.
# Sums
print('Sum across columns')
print(data.sum(axis=1)) # summing across columns. Not terribly useful here.
print('\nSum across rows')
print(data.sum(axis=0)) # summing across columns. Cumulative GDP, consumption
print('\nSum up gdp')
print(data['gdp'].sum(axis=0)) # Sum a single column.
# Means
print('\nMean of each column')
print(data.mean(axis=0))
print('\nMean gdp and cons')
print(data[['gdp', 'cons']].mean(axis=0))
Sum across columns 0 2064.106780 1 2071.973684 2 2083.666667 3 2093.623077 dtype: float64 Sum across rows year 7990.000000 gdp 36.700000 cons 24.200000 cons_share 262.470207 dtype: float64 Sum up gdp 36.7 Mean of each column year 1997.500000 gdp 9.175000 cons 6.050000 cons_share 65.617552 dtype: float64 Mean gdp and cons gdp 9.175 cons 6.050 dtype: float64
Try TAB completion to see the methods available or the documentation.
Here are a few: sum, mean, var, std, skew, rank, quantile, mode, min, max, kurtosis, cumsum, cumprod...
These will be even more powerful once we learn how to group data within a DataFrame and compute statistics by group.
One very useful one...
print(data.describe()) # a good place to start with a new data set
data
year gdp cons cons_share count 4.000000 4.00000 4.000000 4.000000 mean 1997.500000 9.17500 6.050000 65.617552 std 6.454972 3.10309 2.157931 1.363750 min 1990.000000 5.90000 3.800000 64.406780 25% 1993.750000 7.17500 4.625000 64.456958 50% 1997.500000 8.90000 5.850000 65.570175 75% 2001.250000 10.90000 7.275000 66.730769 max 2005.000000 13.00000 8.700000 66.923077
year | gdp | cons | cons_share | |
---|---|---|---|---|
0 | 1990 | 5.9 | 3.8 | 64.406780 |
1 | 1995 | 7.6 | 4.9 | 64.473684 |
2 | 2000 | 10.2 | 6.8 | 66.666667 |
3 | 2005 | 13.0 | 8.7 | 66.923077 |
Practice: DataFrames ¶
Take a few minutes and try the following. Feel free to chat with those around you if you get stuck.
- Compute the mean of the consumption share for 1990 and 1995. You might try using
loc[]
with two argumentsloc[rows, cols]
data.set_index('year', inplace=True)
mean_c = data.loc[1990:1995,'cons_share'].mean()
print(data.loc[1990:1995,'cons_share'])
print(mean_c)
year 1990 64.406780 1995 64.473684 Name: cons_share, dtype: float64 64.44023193577164
- Try
desc = data.describe()
What is the return type?
desc = data.describe()
print(type(desc))
print(desc)
<class 'pandas.core.frame.DataFrame'> gdp cons cons_share count 4.00000 4.000000 4.000000 mean 9.17500 6.050000 65.617552 std 3.10309 2.157931 1.363750 min 5.90000 3.800000 64.406780 25% 7.17500 4.625000 64.456958 50% 8.90000 5.850000 65.570175 75% 10.90000 7.275000 66.730769 max 13.00000 8.700000 66.923077
- Looking ahead, try out the following code. What does it do? Can you find the files? What types are they? What are inside of them?
desc.to_csv('desc.csv')
desc.to_excel('desc.xlsx')