For many use cases, you’ll only need Pandas, since it contains some of the functionality of the openpyxl package right out of the box. Other times, you’ll want to import openpyxl
directly. In this case, we’ll explicitly import openpyxl
, as this allows us to write data to an existing Excel file.
Openpyxl is a library for manipulating Excel workbooks and worksheets. In this example we read in a workbook, navigate to a specific sheet and save the values on that sheet as a Pandas DataFrame. Then we can use Pandas methods to clean the data and save it back to the same workbook in a new worksheet.
The advantages of Python in this case are:
Excel is good for small, quick analyses. It allows you to easily see the data in a familiar format. It can present data in tables and basic charts.
Python out preforms Excel if you are working with lots of data (Excel max row count is 1 million), applying formulas (especially lookups, matches, and sumifs), and using array functions.
There are data wrangling features in Python that are not available in Excel such as stack/unstack, unpivot, and melt.
If a worksheet relies heavily on VBA, it is probably a good candidate for converting to Python. Doing the analysis/transforms in Python will be faster and avoid manual errors which can easily happen in Excel.
In this module, we will:
! pip install -r requirements.txt
import pandas as pd
import openpyxl
path_to_data = 'data/excel_input.xlsx'
path_to_output = 'data/excel_output_new_file.xlsx'
print('The path to the Excel file we are going to read is:')
print(path_to_data)
print('\nThe path to the new Excel file we are going to create is:')
print(path_to_output)
retail = pd.read_excel(path_to_data, sheet_name='retail')
mfn = pd.read_excel(path_to_data, sheet_name='mfn')
print('retail data:\n')
print(retail)
print('\nmfn data:\n')
print(mfn)
In this case, we’ll create another data frame that will become a tab in Excel
combined = pd.concat([retail, mfn])
print('combined data:\n')
print(combined)
combined.to_excel(path_to_output, sheet_name='combined', index=False)
If used incorrectly, this can cause data loss in your original file. If you aren’t sure, create a backup of your input file first.
# create a workbook object using openpyxl
workbook = openpyxl.load_workbook(path_to_data)
# create a pandas ExcelWriter object from the same source file
writer = pd.ExcelWriter(path_to_data, engine='openpyxl')
# link the two objects together
writer.book = workbook
# save
combined.to_excel(writer, sheet_name='combined', index=False)
writer.save()
This example cleans a data file. Instructions for doing this cleaning using native Excel are included in the Workbook.
import pandas as pd
import openpyxl
filename = 'data/XL-02-PC-05-Cleaning-Up-Data-Before.xlsx'
wb = openpyxl.load_workbook(filename)
ws = wb['Data']
df = pd.DataFrame(ws.values)
# Makes first line the column names
def make_header(df: pd.DataFrame)-> pd.DataFrame:
df.columns = df.iloc[0]
df.drop(df.index[0], inplace=True)
return df
df = make_header(df)
# Expanding data in single columns to multiple columns
df[['First Name', 'Last Name']]= df['Customer Name'].str.split(" ", n=1, expand=True)
df[['Address', 'City', 'State', 'ZIP']]= df['Address, City, State, and ZIP'].str.split(",", n=3, expand=True)
# Changing the case of new columns
df['Address'] = df['Address'].str.title()
df['City'] = df['City'].str.title()
df['State'] = df['State'].str.upper()
from openpyxl.utils.dataframe import dataframe_to_rows
# Change the file name if you want to preserve the origional version
wb.create_sheet(index=0, title='Cleaned Data')
ws = wb['Cleaned Data']
for r in dataframe_to_rows(df, index=False, header=True):
ws.append(r)
wb.save(filename)
wb.close()
This section lists some ideas for extending the tutorial that you may wish to explore.
This section provides more resources on the topic if you are looking to go deeper.
In this tutorial, you worked with Openpyxl. Specifically, you learned: