Optimizing Data Lakes for Apache Spark
Spark code will run faster with certain data lakes than others. For example, Spark will run slowly if the data lake uses gzip compression and has unequally sized files (especially […]
Spark code will run faster with certain data lakes than others. For example, Spark will run slowly if the data lake uses gzip compression and has unequally sized files (especially […]
Spark runs slowly when it reads data from a lot of small files in S3. You can make your Spark code run faster by creating a job that compacts small […]
Spark broadcast joins are perfect for joining a large DataFrame with a small DataFrame. Broadcast joins cannot be used when joining two large DataFrames. This post explains how to do […]
This blog post explains how to filter duplicate records from Spark DataFrames with the dropDuplicates() and killDuplicates() methods. It also demonstrates how to collapse duplicate records into a single row […]
Spark programmers only need to know a small subset of the Scala API to be productive. Scala has a reputation for being a difficult language to learn and that scares […]
sbt-assembly makes it easy to shade dependencies in your Spark projects when you create fat JAR files. This blog post will explain why it’s useful to shade dependencies and will […]
Spark SQL functions make it easy to perform DataFrame analyses. This post will show you how to use the built-in Spark SQL functions and how to build your own SQL […]
Spark DataFrames are similar to tables in relational databases – they store data in columns and rows and support a variety of operations to manipulate the data. Here’s an example […]
Spark codebases can easily become a collection of order dependent custom transformations (see this blog post for background on custom transformations). Your library will be difficult to use if many […]
Spark Structured Streaming and Trigger.Once can be used to incrementally update Spark extracts with ease. An extract that updates incrementally will take the same amount of time as a normal […]