Pandas is useful for doing data analysis in Python. Today we will discuss how to install Pandas, some of the basic concepts of Pandas Dataframes, then some of the common Pandas use cases.
What is Pandas?
Pandas is a Python library for doing data analysis. Typically you will use it for working with 1-dimentional series data, or 2-dimentional data called data frames. This might include:
- Tabular data like SQL tables or Excel spreadsheets
- Ordered and unordered time series data.
- Matrix data with row and column labels
- Other observational / statistical data sets
We will get into more specific examples later on.
Pandas is built on top of Numpy, so you could think of it as an extension of that. For that reason, it is good to have an understanding on Numpy before proceeding. Although that is not a strict requirement.
What is Numpy?
Numpy is short for Numerical Python. It is a python library for working with multidimensional arrays and matrices. A few of the useful features of Numpy are:
- Creating multi-dimentional arrays
- Indexing and searching arrays
- Array arithemtic.
- Comparing arrays with different dimensions
Installing Pandas
The easiest way to install Pandas is to use Pip. You can runPip3 install Pandas- or -Pip install PandasThis will depend on if you are using Python 2 or Python 3. I recommend using Python 3 since Python 2 will be end of life in the next few years. If you don’t have Pip installed, checkout our article on Python Basics for instructions on how to install Pip.
As an alternative to using Pip you you can also install Anaconda. Anaconda includes a whole suite of datascience tools, one of which is Pandas. It also includes Numpy.
Using Pandas
For our examples, we are going to use the 2016 presidential polling data from fivethirtyeight.com. The data can be found here. But we are going to code the URL into our code, so downloading it ahead of time is optional. Although, your script will run faster if you download it ahead of time. Some of the more common use cases for Pandas are:
- Filtering Data
- Summarizing Data
- Plotting Data
- Pivoting Data
First thing to do is import our dependencies, and then read in our csv file with our data:
import pandas as pd import matplotlib.pyplot as plt import numpy as np #create data frame by using the read_csv function from Pandas df = pd.read_csv("http://projects.fivethirtyeight.com/general-model/president_general_polls_2016.csv")
Filtering
You could then run print(df) and see all of the data from the CSV. that is not that useful. We should focus in on the data we really care about. This dataset contains polling data from various surveys from all 50 states. Let’s pick a specific poll and a specific state:
#Create a new dataframe with only YouGov polls from the state of California df_filtered = df[(df.state=='California') & (df.pollster=='YouGov')]
Plotting
Next, let’s plot the poll results of both Trump and Clinton:
df_filtered["adjpoll_clinton"].plot() df_filtered["adjpoll_trump"].plot() plt.show()
Your result should look something like this:
That is useful. But it would be more helpful if we could add some labels
We can add the legend parameter to identify each line:
df_filtered["adjpoll_clinton"].plot(legend=True) df_filtered["adjpoll_trump"].plot(legend=True)
your chart should now look more like this:
That looks even better. When we start going beyond this point, I think it is a lot easier to use matplotlib directly to do more plotting. Here is a similar plot done using matplotlib:
import pandas as pd import matplotlib.pyplot as plt import numpy as np df = pd.read_csv("http://projects.fivethirtyeight.com/general-model/president_general_polls_2016.csv") df = df.sort_values ('startdate',ascending=False) plt.plot(df['startdate'],df['adjpoll_clinton']) plt.plot(df['startdate'],df['adjpoll_trump']) plt.legend() plt.ylabel('Approval') plt.xticks(rotation=45) plt.show()
Here is the result:
As you can see above, we start by importing our libraries, then reading our csv file. We then sort our values based on the date of the poll, then we plot both the Clinton and trump approval ratings. We add a legend by calling plt.legend(). We add the label on the left side of the graph using the plt.ylabel command. We then rotate the dates along the bottom by 45 degrees with the plt.xticks command. Finally we show our graph with the plt.show() command.
When you do plotting, Pandas is just using matplotlib anyway. So what we have done is stepped back and done it outside of pandas. But it is still using the same libraries.
Pivoting
Pivoting data is when you take the columns and make them the rows and vice versa. It is a good way to get a different perspective on your data. And it is better than simply tilting your head to the left. We will use the same dataset as the previous section in our examples. Just like before, we will start by importing our libraries:
import pandas as pd
Next we read our CSV file and create our data frame:
df = pd.read_csv("http://projects.fivethirtyeight.com/general-model/president_general_polls_2016.csv")
Next we want to see what Registered Voters are saying vs Likely Voters in our samples. So we are going to Pivot using the population column as our column list:
df.pivot(columns='population',values='adjpoll_clinton')
Your output should look similar to this:
Using this pivot table you can see the approval ratings for Clinton among likely voters and registered voters. Those NaN’s get in the way, so let’s get the average of each column:
df.pivot(columns='population',values='adjpoll_clinton').mean(skipna=True)
In the above command we added the .mean() function with the skipna=True option. This takes the average of each column, but skips all of the NaN values.
Your output should look similar to this:
Here is all of our pivot table code consolidated:
import pandas as pd df = pd.read_csv("http://projects.fivethirtyeight.com/general-model/president_general_polls_2016.csv") #Filter to only show data from the state of California df=df[(df.state=='California')] #Pivot to show the lv/rv data as the columns print(df.pivot(columns='population',values='adjpoll_clinton')) #Show the averages for lv and rv (registered voters, likely voters) print(df.pivot(columns='population',values='adjpoll_clinton').mean(skipna=True))
Summarizing
It can be taunting to look at a large dataset. However, Pandas gives you some nice tools for summarizing the data so you don’t have to try to take on the entire dataset at once.
To start, we have the min, max and median functions. These functions do as they say and return the minimum, maximum, and average values. You can see examples of each below using our Pivot Table from the previous section:
df.pivot(columns='population',values='adjpoll_clinton').mean(skipna=True) df.pivot(columns='population',values='adjpoll_clinton').max(skipna=True) df.pivot(columns='population',values='adjpoll_clinton').min(skipna=True)
Next it might be helpful to know the number of unique values you have in a dataset:
df.pivot(columns='population',values='adjpoll_clinton').nunique()
Or if you just want a quick summary, you can use the describe function:
df.pivot(columns='population',values='adjpoll_clinton').describe()
The output of the describe function is the most useful as it combines many of the previous functions we talked about. Your output will look similar to this: