Python can be used to create websites, automate tasks, and do numerical analysis. The Python library that is most often used for analysis is called Pandas (short for Panel Analysis).
After completing this tutorial, you will know:
Let’s get started.
This tutorial is divided into 2 parts:
Pandas is all about DataFrames (df), which are analogous to tables in Excel or SQL, or to R’s data.frame. Dataframes are made up of Series (i.e. columns) and can be manipulated with simple or complex transformations like filtering, aggregating, joining, and pivoting/unpivoting.
We’ll import a sample data set with daily sales in the US marketplace for 2017, grouped by GL and including a column for product family (big4). To read the text file into memory, we provide the path to the file. You only need to provide the path relative to the folder that contains your code. For example, this Jupyter notebook is in a folder called Workbooks. So when we specify the path as ‘Support_Files/sales_fake.txt’, Pandas will look for the file called sales_fake.txt inside Workbooks/Support_Files/. Alternatively, you can always hard code the entire path (e.g. ‘C:/some_folder/some_subfolder/sales_fake.txt’) if you are unsure how to use relative paths.
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
# Read data into Python from .txt
sales = pd.read_csv('data/sales_fake.csv')
# View the first 10 rows of the dataframe (see also: tail)
sales.head(10)
The .head() method above is just one of the many methods that can be applied to Pandas dataframes. In general, these methods return some transformation of the dataframe to which they are applied. Take a few minutes and play around with some of the methods below, using the sales dataframe.
See https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html under “Methods” for a complete list.
Some useful methods:
.count() .describe() .info() .mean() .median() .nunique() .plot() .shift() .transpose()
Dataframes also have certain attributes which are accessed similarly, but without the (). What do these attributes represent?
.columns .dtypes .empty .shape .size .values
### CODE FOR ABOVE EXERCISE ###
# Select just a single column (i.e. Series object) from a dataframe:
sales['units']
# The above series isn't very useful by itself, since we lose the date information. To select a subset of columns, pass in a list of columns:
col_list = ['order_day', 'gl', 'ops']
sales[col_list].head()
# Or equivalently:
sales[['order_day', 'gl', 'ops']].head()
# What are all the unique product families? use unique() on a single series:
sales['product_family'].unique()
# Or get the count of observations from each product family. Note that the .value_counts() method only works if applied to a Series - not a DataFrame.
# What happens if you try to call .value_counts() on a DataFrame?
sales['product_family'].value_counts()
# Print the total sales for the year. The .sum() method works on both Series and DataFrames.
sales['ops'].sum()
# Operations on the series that make up a dataframe generally work as you'd expect.
# E.g. add a column for country
sales['country'] = 'US'
sales.head()
# Adding calculated columns is easy, too. First, calculate the ASP and print the results:
sales['ops'] / sales['units']
### CODE FOR ABOVE EXERCISE ###
The .loc() method is highly useful for filtering dfs. You can think of the condition(s) inside .loc as being roughly equivalent to the “ifs” in Excel’s sumifs function, or the conditions in a WHERE clause in SQL.
# Filter only for the pantry GL and name this view pantry:
pantry = sales.loc[sales['gl'] == 'Amazon_Pantry']
pantry.head()
# to add multiple conditions, wrap each condition in () and combine them together with &
sales.loc[(sales['order_day'] == '2017-07-01') & (sales['gl'] == 'Apparel')]
### CODE FOR ABOVE EXERCISE ###
# In the background, Pandas uses a Python package called matplotlib to make plotting convenient.
# Since our toys df is indexed on date, Pandas automatically treats date as the x-axis.
toys = sales[sales['gl']=='Toys'].copy()
toys.set_index('order_day', inplace=True)
toys['ops'].plot()
Check out the API documentation for the pd.Series.plot method to find other allowable values of the “kind” argument:
http://pandas.pydata.org/pandas-docs/version/0.22/generated/pandas.Series.plot.html
### CODE FOR ABOVE EXERCISE ###
# Or we can plot multiple metrics with a shared axis using subplots=True.
# With a single line of code, Pandas can create detailed visualizations
toys['asp'] = toys['ops'] / toys['units']
toys[['units', 'ops', 'asp']].plot(subplots=True)
# Histograms are simple to create
toys['units'].hist(bins=20)
# If you want to export a DataFrame back to a text file, the syntax to do this is similar to pd.read_csv().
# First, define the path / location where you want the file to be saved (including the filename):
path_to_save_data = 'data/sales_fake_output.csv'
sales.to_csv(path_to_save_data, index=False)
# Sometimes you might want to quickly look at the data in Excel.
# Add it to the clipboard and paste it wherever you want!
sales.to_clipboard()
sales.gl.unique()
Use the sales DataFrame to answer the following questions.
# First read in the data
sales = pd.read_csv('data/sales_fake.csv')
This section lists some ideas for extending the tutorial that you may wish to explore.
This section provides more resources on the topic if you are looking to go deeper.
In this tutorial, you were introduced to the Pandas library. Specifically, you learned:
In the next section, you will use Pandas to import and work with data.