Apache Spark

Apache Spark is a big data engine that has quickly become one of the biggest distributed processing frameworks in the world. It’s used by all the big financial institutions and […]

This post describes how to programatically compact Parquet files in a folder. Incremental updates frequently result in lots of small files that can be slow to read. It’s best to […]

Spark makes it easy to broadcast maps and perform hash lookups in a cluster computing environment. This post explains how to broadcast maps and how to use these broadcasted variables […]