Delta lakes are versioned so you can easily revert to old versions of the data. In some instances, Delta lake needs to store multiple versions of the data to enable […]

This post explains how to compact small files in Delta lakes with Spark. Data lakes can accumulate a lot of small files, especially when they’re incrementally updated. Small files cause […]

This post describes how to programatically compact Parquet files in a folder. Incremental updates frequently result in lots of small files that can be slow to read. It’s best to […]

Spark makes it easy to broadcast maps and perform hash lookups in a cluster computing environment. This post explains how to broadcast maps and how to use these broadcasted variables […]