MungingData - Page 10 of 13 - Piles of precious data

Compacting Parquet Files

mrpowers September 3, 2019 3

This post describes how to programatically compact Parquet files in a folder. Incremental updates frequently result in lots of small files that can be slow to read. It’s best to […]

Broadcasting Maps in Spark

mrpowers August 28, 2019 0

Spark makes it easy to broadcast maps and perform hash lookups in a cluster computing environment. This post explains how to broadcast maps and how to use these broadcasted variables […]

Reverse Engineering Spark Structured Streaming and Trigger.Once

mrpowers August 11, 2019 1

Spark Structured Streaming and Trigger.Once make it easy to run incremental updates. Spark uses a checkpoint directory to identify the data that’s already been processed and only analyzes the new […]

Introduction to Delta Lake and Time Travel

mrpowers August 8, 2019 0

Delta Lake is a wonderful technology that adds powerful features to Parquet data lakes. This blog post demonstrates how to create and incrementally update Delta lakes. We will learn how […]

Fast Filtering with Spark PartitionFilters and PushedFilters

mrpowers July 23, 2019 0

Spark can use the disk partitioning of files to greatly speed up certain filtering operations. This post explains the difference between memory and disk partitioning, describes how to analyze physical […]

How to use the Spark Shell (REPL)

mrpowers July 23, 2019 0

The Spark console is a great way to run Spark code on your local machine. You can easily create a DataFrame and play around with code in the Spark console […]

Managing the SparkSession, The DataFrame Entry Point

mrpowers April 9, 2019 3

The SparkSession is used to create and read DataFrames. It’s used whenever you create a DataFrame in your test suite or whenever you read a Parquet / CSV data lake […]

Building Spark Projects with Mill

mrpowers April 4, 2019 0

Mill is a SBT alternative that can be used to build Spark projects. This post explains how to create a Spark project with Mill and why you might want to […]

Working with Spark ArrayType columns

mrpowers March 17, 2019 6

Spark DataFrame columns support arrays, which are great for data sets that have an arbitrary length. This blog post will demonstrate Spark methods that return ArrayType columns, describe how to […]

Introduction to SBT for Spark Programmers

mrpowers March 9, 2019 1

SBT is an interactive build tool that is used to run tests and package your projects as JAR files. SBT lets you create a project in a text editor and […]