Apache Spark Archives

Convert streaming CSV data to Delta Lake with different latency requirements

mrpowers June 4, 2022 0

This blog post explains how to incrementally convert streaming CSV data into Delta Lake with different latency requirements. A streaming CSV data source is used because it’s easy to demo, […]

Registering Native Spark Functions

mrpowers May 7, 2021 0

This post explains how Spark registers native functions internally and the public facing APIs for you to register your own functions. Registering native functions is important if you want to […]

Exploring DataFrames with summary and describe

mrpowers April 16, 2021 2

The summary and describe methods make it easy to explore the contents of a DataFrame at a high level. This post shows you how to use these methods. TL;DR – […]

Calculating Percentile, Approximate Percentile, and Median with Spark

mrpowers April 11, 2021 0

This blog post explains how to compute the percentile, approximate percentile and median of a column in Spark. There are a variety of different ways to perform these computations and […]

Scala Spark vs Python PySpark: Which is better?

mrpowers February 8, 2021 1

Apache Spark code can be written with the Scala, Java, Python, or R APIs. Scala and Python are the most popular APIs. This blog post performs a detailed comparison of […]

Spark Datasets: Advantages and Limitations

mrpowers January 27, 2021 0

Datasets are available to Spark Scala/Java users and offer more type safety than DataFrames. Python and R infer types during runtime, so these APIs cannot support the Datasets. This post […]

Calculating Month Start and End Dates with Spark

mrpowers January 2, 2021 0

This post shows how to create beginningOfMonthDate and endOfMonthDate functions by leveraging the native Spark datetime functions. The native Spark datetime functions are not easy to use, so it’s important […]

Calculating Week Start and Week End Dates with Spark

mrpowers December 31, 2020 1

You can use native Spark functions to compute the beginning and end dates for a week, but the code isn’t intuitive. This blog post demonstrates how to wrap the complex […]

Migrating Scala Projects to Spark 3

mrpowers December 1, 2020 0

This post explains how to migrate your Scala projects to Spark 3. It covers the high level steps and doesn’t get into all the details. Migrating PySpark projects is easier. […]

Reading data from Google Sheets to Spark DataFrames

mrpowers November 24, 2020 0

This blog post explains how to read a Google Sheet into a Spark DataFrame with the spark-google-spreadsheets library. Google Sheets is not a good place to store a lot of […]

MungingData

Piles of precious data

Apache Spark