Apache Spark

The term “column equality” refers to two different things in Spark: When a column is equal to a particular value (typically when filtering) When all the values in two columns […]

Spark DataFrame columns support maps, which are great for key / value pairs with an arbitrary length. This blog post describes how to create MapType columns, demonstrates built-in functions to […]

Apache Spark is a big data engine that has quickly become one of the biggest distributed processing frameworks in the world. It’s used by all the big financial institutions and […]

This post describes how to programatically compact Parquet files in a folder. Incremental updates frequently result in lots of small files that can be slow to read. It’s best to […]