Every data visualization and information design project involves data cleaning and preparation. Love it or hate it (most people feel the later), ‘data munging’ is a necessary step and unique skill in the creation of good work. The Python library Pandas provides a terrific set of tools to do just that.
The Wikipedia page for Pandas describes it as such:
In computer programming, pandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license. The name is derived from the term “panel data“, an econometrics term for data sets that include observations over multiple time periods for the same individuals.Wikipedia
Here’s a quick example of how I am using Pandas for the UN Agency Funding data visualizations I am working on.
Data Cleaning and Prep Steps in the Python Notebook
The notebook below is broken out into 5 different steps.
- Load the pandas library and set some options to aid in viewing the data and results
- Load in the csv files. Pandas makes it very easy to load in several types of data or even data directly from an SQL database. So nice.
- Combine all the data files into a single data frame. It is possible to remove this step by loading everything into a single data frame initially, however I like keeping things explicit, simple (and easy to debug). The
shapecommand at the end of this step shows us how many rows and columns are in our data frame.
- Here we are creating a new pandas object called ‘agencies’ to hold a list of the unique names in the ‘Agency description’ column. The
len()method then tells us there are 43 unique entries in the agency object.
- Final step is to sort this list alphabetically and then print out each agency name on a single line.
That’s it! Pretty straightforward. With this example problem it is just as easy to run it through Excel but using Pandas here shows us the tip of the iceberg for what is possible with just a few lines of code.
Some helpful links and resources
Pandas’ own documentation is always a good place to start, https://pandas.pydata.org/docs/getting_started/index.html
Who doesn’t love a good cheatsheet!? DataCamp provides a good one along with some example code here: https://www.datacamp.com/community/blog/python-pandas-cheat-sheet