2018 - MungingData

Working with dates and times in Spark

mrpowers December 22, 2018 2

Spark supports DateType and TimestampType columns and defines a rich API of functions to make working with dates and times easy. This blog post will demonstrates how to make DataFrames […]

AWS Athena and Apache Spark are Best Friends

mrpowers December 10, 2018 0

Apache Spark makes it easy to build data lakes that are optimized for AWS Athena queries. This blog post will demonstrate that it’s easy to follow the AWS Athena tuning […]

Optimizing Data Lakes for Apache Spark

mrpowers December 9, 2018 0

Spark code will run faster with certain data lakes than others. For example, Spark will run slowly if the data lake uses gzip compression and has unequally sized files (especially […]

Compacting Files with Spark to Address the Small File Problem

mrpowers October 21, 2018 1

Spark runs slowly when it reads data from a lot of small files in S3. You can make your Spark code run faster by creating a job that compacts small […]

Introduction to Spark Broadcast Joins

mrpowers October 17, 2018 0

Spark broadcast joins are perfect for joining a large DataFrame with a small DataFrame. Broadcast joins cannot be used when joining two large DataFrames. This post explains how to do […]

Deduplicating and Collapsing Records in Spark DataFrames

mrpowers October 6, 2018 0

This blog post explains how to filter duplicate records from Spark DataFrames with the dropDuplicates() and killDuplicates() methods. It also demonstrates how to collapse duplicate records into a single row […]

Just Enough Scala for Spark Programmers

mrpowers September 29, 2018 0

Spark programmers only need to know a small subset of the Scala API to be productive. Scala has a reputation for being a difficult language to learn and that scares […]

Shading Dependencies in Spark Projects with SBT

mrpowers September 23, 2018 1

sbt-assembly makes it easy to shade dependencies in your Spark projects when you create fat JAR files. This blog post will explain why it’s useful to shade dependencies and will […]

Introduction to Spark SQL functions

mrpowers September 19, 2018 1

Spark SQL functions make it easy to perform DataFrame analyses. This post will show you how to use the built-in Spark SQL functions and how to build your own SQL […]

Introduction to Spark DataFrames

mrpowers September 9, 2018 0

Spark DataFrames are similar to tables in relational databases – they store data in columns and rows and support a variety of operations to manipulate the data. Here’s an example […]