Reading Delta Lakes into pandas DataFrames
This post explains how to read Delta Lakes into pandas DataFrames. The delta-rs library makes this incredibly easy and doesn’t require any Spark dependencies. Let’s look at some simple examples, […]
This post explains how to read Delta Lakes into pandas DataFrames. The delta-rs library makes this incredibly easy and doesn’t require any Spark dependencies. Let’s look at some simple examples, […]
This post explains how to test Pandas code with the built-in test helper methods and with the beavis functions that give more readable error messages. Unit testing helps you write […]
This post explains how to create DataFrames with ArrayType columns and how to perform common data processing operations. Array columns are one of the most useful column types, but they’re […]
This post explains how to define PySpark schemas and when this design pattern is useful. It’ll also explain when defining schemas seems wise, but can actually be safely avoided. Schemas […]
This article explains how to rename a single or multiple columns in a Pandas DataFrame. There are multiple different ways to rename columns and you’ll often want to perform this […]
This post explains how to add constant columns to PySpark DataFrames with lit and typedLit. You’ll see examples where these functions are useful and when these functions are invoked implicitly. […]
This blog post shows you how to gracefully handle null in PySpark and how to avoid null input errors. Mismanaging the null case is a common source of errors and […]
This post explains how to create a SparkSession with getOrCreate and how to reuse the SparkSession with getActiveSession. You need a SparkSession to read data stored in files, when manually […]
This post explains how to write a Dask DataFrame to CSV files. You’ll see how to write CSV files, customize the filename, change the compression, and append files to an […]
Dask DataFrames are composed of multiple partitions and are outputted as multiple files, one per partition, by default. This post explains the different approaches to write a Dask DataFrame to […]