Compacting Parquet Files
This post describes how to programatically compact Parquet files in a folder. Incremental updates frequently result in lots of small files that can be slow to read. It’s best to […]
This post describes how to programatically compact Parquet files in a folder. Incremental updates frequently result in lots of small files that can be slow to read. It’s best to […]
Spark makes it easy to broadcast maps and perform hash lookups in a cluster computing environment. This post explains how to broadcast maps and how to use these broadcasted variables […]
Spark Structured Streaming and Trigger.Once make it easy to run incremental updates. Spark uses a checkpoint directory to identify the data that’s already been processed and only analyzes the new […]
Delta Lake is a wonderful technology that adds powerful features to Parquet data lakes. This blog post demonstrates how to create and incrementally update Delta lakes. We will learn how […]
Spark can use the disk partitioning of files to greatly speed up certain filtering operations. This post explains the difference between memory and disk partitioning, describes how to analyze physical […]
The Spark console is a great way to run Spark code on your local machine. You can easily create a DataFrame and play around with code in the Spark console […]
The SparkSession is used to create and read DataFrames. It’s used whenever you create a DataFrame in your test suite or whenever you read a Parquet / CSV data lake […]
Mill is a SBT alternative that can be used to build Spark projects. This post explains how to create a Spark project with Mill and why you might want to […]
Spark DataFrame columns support arrays, which are great for data sets that have an arbitrary length. This blog post will demonstrate Spark methods that return ArrayType columns, describe how to […]
SBT is an interactive build tool that is used to run tests and package your projects as JAR files. SBT lets you create a project in a text editor and […]