MungingData - Page 2 of 13 - Piles of precious data

Add Category Column to pandas DataFrame with cut

mrpowers December 5, 2021 0

This post explains how to add a category column to a pandas DataFrame with cut(). cut makes it easy to categorize numerical values in buckets. Let’s look at a a […]

Managing Dask Software Environments with Conda

mrpowers November 24, 2021 0

This post shows you how to set up conda on your machine and explains why it’s the best way to manage software environments for Dask projects. This blog post says […]

Splitting Large CSV files with Python

mrpowers November 24, 2021 0

This blog post demonstrates different approaches for splitting a large CSV file into smaller CSV files and outlines the costs / benefits of the different approaches. TL;DR It’s faster to […]

7 Steps for rejecting meeting invites

mrpowers November 17, 2021 0

Meetings are the main way to kill your productivity as a creative professional. Two strategically timed meetings can eliminate your makers hours for an entire day. Rejecting meeting invites to […]

Self Publishing High Quality Programming Books

mrpowers October 14, 2021 0

This post describes a workflow for self publishing programming books that readers will love. Writing a book seems like a daunting task, but it’s less intimidating if each chapter is […]

Reading Delta Lakes into pandas DataFrames

mrpowers October 11, 2021 0

This post explains how to read Delta Lakes into pandas DataFrames. The delta-rs library makes this incredibly easy and doesn’t require any Spark dependencies. Let’s look at some simple examples, […]

Testing Pandas Code

mrpowers August 9, 2021 0

This post explains how to test Pandas code with the built-in test helper methods and with the beavis functions that give more readable error messages. Unit testing helps you write […]

Working with PySpark ArrayType Columns

mrpowers June 28, 2021 0

This post explains how to create DataFrames with ArrayType columns and how to perform common data processing operations. Array columns are one of the most useful column types, but they’re […]

Defining PySpark Schemas with StructType and StructField

mrpowers June 26, 2021 0

This post explains how to define PySpark schemas and when this design pattern is useful. It’ll also explain when defining schemas seems wise, but can actually be safely avoided. Schemas […]

Renaming Columns in Pandas DataFrames

mrpowers June 23, 2021 0

This article explains how to rename a single or multiple columns in a Pandas DataFrame. There are multiple different ways to rename columns and you’ll often want to perform this […]