Every data visualization and information design project involves data cleaning and preparation. Love it or hate it (most people feel the later), ‘data munging’ is a necessary step and unique skill in the creation of good work. The Python library Pandas provides a terrific set of tools to do just that.

The Wikipedia page for Pandas describes it as such:

In computer programmingpandas is a software library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. It is free software released under the three-clause BSD license.[2] The name is derived from the term “panel data“, an econometrics term for data sets that include observations over multiple time periods for the same individuals.[3]

Wikipedia

Here’s a quick example of how I am using Pandas for the UN Agency Funding data visualizations I am working on.

UN Agency Funding Visualization showing $23B in spending during 2015.

Data Cleaning and Prep Steps in the Python Notebook

The notebook below is broken out into 5 different steps.

  1. Load the pandas library and set some options to aid in viewing the data and results
  2. Load in the csv files. Pandas makes it very easy to load in several types of data or even data directly from an SQL database. So nice.
  3. Combine all the data files into a single data frame. It is possible to remove this step by loading everything into a single data frame initially, however I like keeping things explicit, simple (and easy to debug). The shape command at the end of this step shows us how many rows and columns are in our data frame.
  4. Here we are creating a new pandas object called ‘agencies’ to hold a list of the unique names in the ‘Agency description’ column. The len() method then tells us there are 43 unique entries in the agency object.
  5. Final step is to sort this list alphabetically and then print out each agency name on a single line.
Screencap of the notebook running on Gist

That’s it! Pretty straightforward. With this example problem it is just as easy to run it through Excel but using Pandas here shows us the tip of the iceberg for what is possible with just a few lines of code.

Some helpful links and resources

Pandas’ own documentation is always a good place to start, https://pandas.pydata.org/docs/getting_started/index.html

Daniel Chen (@chendaniely) has several great conference presentations on YouTube.
https://www.youtube.com/watch?v=5rNu16O3YNE

Who doesn’t love a good cheatsheet!? DataCamp provides a good one along with some example code here: https://www.datacamp.com/community/blog/python-pandas-cheat-sheet