Working with dates and times in Spark
Spark supports DateType and TimestampType columns and defines a rich API of functions to make working with dates and times easy. This blog post will demonstrates how to make DataFrames […]
Spark supports DateType and TimestampType columns and defines a rich API of functions to make working with dates and times easy. This blog post will demonstrates how to make DataFrames […]
Apache Spark makes it easy to build data lakes that are optimized for AWS Athena queries. This blog post will demonstrate that it’s easy to follow the AWS Athena tuning […]
Spark code will run faster with certain data lakes than others. For example, Spark will run slowly if the data lake uses gzip compression and has unequally sized files (especially […]
Spark runs slowly when it reads data from a lot of small files in S3. You can make your Spark code run faster by creating a job that compacts small […]
Spark broadcast joins are perfect for joining a large DataFrame with a small DataFrame. Broadcast joins cannot be used when joining two large DataFrames. This post explains how to do […]
This blog post explains how to filter duplicate records from Spark DataFrames with the dropDuplicates() and killDuplicates() methods. It also demonstrates how to collapse duplicate records into a single row […]
Spark programmers only need to know a small subset of the Scala API to be productive. Scala has a reputation for being a difficult language to learn and that scares […]
sbt-assembly makes it easy to shade dependencies in your Spark projects when you create fat JAR files. This blog post will explain why it’s useful to shade dependencies and will […]
Spark SQL functions make it easy to perform DataFrame analyses. This post will show you how to use the built-in Spark SQL functions and how to build your own SQL […]
Spark DataFrames are similar to tables in relational databases – they store data in columns and rows and support a variety of operations to manipulate the data. Here’s an example […]