This post describes how to programatically compact Parquet files in a folder. Incremental updates frequently result in lots of small files that can be slow to read. It’s best to […]

Spark makes it easy to broadcast maps and perform hash lookups in a cluster computing environment. This post explains how to broadcast maps and how to use these broadcasted variables […]

Spark DataFrame columns support arrays, which are great for data sets that have an arbitrary length. This blog post will demonstrate Spark methods that return ArrayType columns, describe how to […]