October 2018 - MungingData

Compacting Files with Spark to Address the Small File Problem

mrpowers October 21, 2018 1

Spark runs slowly when it reads data from a lot of small files in S3. You can make your Spark code run faster by creating a job that compacts small […]

Introduction to Spark Broadcast Joins

mrpowers October 17, 2018 0

Spark broadcast joins are perfect for joining a large DataFrame with a small DataFrame. Broadcast joins cannot be used when joining two large DataFrames. This post explains how to do […]

Deduplicating and Collapsing Records in Spark DataFrames

mrpowers October 6, 2018 0

This blog post explains how to filter duplicate records from Spark DataFrames with the dropDuplicates() and killDuplicates() methods. It also demonstrates how to collapse duplicate records into a single row […]