Amazing Python Data Workflow with Poetry, Pandas, and Jupyter

Poetry makes it easy to install Pandas and Jupyter to perform data analyses.

Poetry is a robust dependency management system and makes it easy to make Python libraries accessible in Jupyter notebooks.

The workflow outlined in this post makes projects that can easily be run on other machines. Your teammates can easily run poetry install to setup an identical Jupyter development environment on their computers.

Python dependency management is hard, especially for projects with notebooks. You'll often find yourself in dependency hell when trying to setup someone else's repo with Jupyter notebooks. This workflow saves you from dependency hell.

This post shows how to manage environments with Poetry, but you can also use conda of course.

Create a project

Install Poetry and run the poetry new blake command to create a project called blake. All the code covered in this post is in a GitHub repo, but it's best to run all the commands on your local machine, so you learn more.

Change into the blake directory with cd blake and examine the file structure:

blake/
  blake/
    __init__.py
  tests/
    __init__.py
    test_blake.py
  pyproject.toml
  README.rst

We'll investigate the contents of these files later in this post.

Install Pandas and Jupyter

Run poetry add pandas jupyter ipykernel to install the dependencies that are required for running notebooks on your local machine.

This command downloads a bunch of Python code in the ~/Library/Caches/pypoetry/virtualenvs/blake-Y_2IcspR-py3.7/ directory. This is referred to as the "virtual environment" of your project.

If you cloned the blake repo, you could simply run poetry install to setup the virtual environment.

Notebook workflow

Run poetry shell in your Terminal to create a subshell within the virtual environment. This is the key step that lets you run a Jupyter notebook with all the right project dependencies.

Run jupyter notebook to open the project with Jupyter in your browser.

Click New => Folder to create a folder called notebooks/.

Go to the notebooks folder and click New => Notebook: Python 3 to create a notebook.

Click Untitled at the top of the page that opens and rename the notebook to be some_pandas_fun:

Run 2 + 2 in the first cell to make sure the notebook can run a basic Python command.

Then run this series of commands in the subsequent cells to create a Pandas DataFrame.

import pandas as pd
data = [['tom', 10], ['nick', 15], ['juli', 14]]
df = pd.DataFrame(data, columns = ['Name', 'Age'])
print(df)

Accessing application code in notebook

The application code goes in the blake/ directory. Create a blake/super_important.py file with a some_message function.

def some_message():
    return "I like dancing reggaeton"

Create another Jupyter notebook, import the some_message function, and run the code to make sure it's accessible.

Access application code in notebook

Testing application code

Create a tests/test_super_important.py file that verifies the some_message function is working properly.

import pytest

from blake.super_important import *

def test_some_message():
    assert some_message() == "I like dancing reggaeton"

Run the test suite with the poetry run pytest tests/ command.

As you can see, Poetry makes it easy to organize the application code, notebooks, and tests for a project.

Write a Parquet file

Let's create a CSV file and then write it out as a Parquet file from a notebook.

Pandas requires PyArrow to write Parquet files so run poetry add pyarrow to include the dependency.

Create the data/coffee.csv file with this data:

coffee_type,has_milk
black,false
latte,true
americano,false

Create a notebooks/csv_to_parquet.ipynb file that'll convert the CSV to a Parquet file.

Here's the code:

import pandas as pd
import os

df = pd.read_csv(os.environ['HOME'] + '/Documents/code/my_apps/blake/data/coffee.csv')
out_dirname = os.environ['HOME'] + '/Documents/code/my_apps/blake/tmp'
os.makedirs(out_dirname, exist_ok=True)
df.to_parquet(out_dirname + '/coffee.parquet')

Read a Parquet file

Create a notebooks/csv_to_parquet.ipynb file and read the Parquet file into a DataFrame.

Then count how many different types of coffee contain milk in the dataset.

import pandas as pd
import os

df = pd.read_parquet(os.environ['HOME'] + '/Documents/code/my_apps/blake/tmp/coffee.parquet')
coffees_with_milk = df[df['has_milk'] == True]
coffees_with_milk.count()

Parquet is a better file format than CSV for almost all data analyses. Use this design pattern to build Parquet files, so you can perform your analyses quicker.

pyproject.toml

Here are the contents of the pyproject.toml file:

[tool.poetry]
name = "blake"
version = "0.1.0"
description = ""
authors = ["MrPowers"]

[tool.poetry.dependencies]
python = "^3.7"
pandas = "^1.1.1"
jupyter = "^1.0.0"
ipykernel = "^5.3.4"
pyarrow = "^1.0.1"

[tool.poetry.dev-dependencies]
pytest = "^5.2"

[build-system]
requires = ["poetry>=0.12"]
build-backend = "poetry.masonry.api"

The project dependencies clearly specify the versions that are required to run this project. The dev-dependencies specify additional dependencies that are required when running the test suite (i.e. when running poetry run pytest tests/).

poetry.lock

Here's how Poetry builds the virtual environment when poetry install is run:

If the poetry.lock file exists, use the exact dependency version specified in the lock file to build the virtual environment
If the poetry.lock file doesn't exist, then use the pypoetry.toml file to resolve the dependencies, build a lock file, and setup the virtual environment

You should check the lock file into source control so collaborators can build a virtual environment that's identical to what you're using.

If you're ever having trouble with the virtual environment or lock file, feel free to simply delete them and recreate them with poetry install. Don't manually modify the lock file or virtual environment.

Conclusion

Poetry allows for an amazing local workflow with Pandas and Jupyter.

Poetry virtual environments are clean, easy to use, and save you from dependency hell.

Poetry also makes it easy to work with Python cluster computing libraries like Dask and PySpark.

The Poetry workflow outlined in this post is especially useful, because it easily extends to a team of multiple developers. You can follow the steps outlined in this guide, upload your code to GitHub, so your teammates can clone the repo and run poetry install to create an identical development environment on their machine.

Poetry saves your teammates from suffering with Python dependency hell. Make your projects easy to use and you'll get more users and adoption!