MungingData - Page 4 of 13 - Piles of precious data

exists and forall PySpark array functions

mrpowers May 1, 2021 0

This blog post demonstrates how to find if any element in a PySpark array meets a condition with exists or if all elements in an array meet a condition with […]

Exploring DataFrames with summary and describe

mrpowers April 16, 2021 2

The summary and describe methods make it easy to explore the contents of a DataFrame at a high level. This post shows you how to use these methods. TL;DR – […]

Calculating Percentile, Approximate Percentile, and Median with Spark

mrpowers April 11, 2021 0

This blog post explains how to compute the percentile, approximate percentile and median of a column in Spark. There are a variety of different ways to perform these computations and […]

Scala is a Maintenance Nightmare

mrpowers March 21, 2021 14

This post explains why Scala projects are difficult to maintain. Scala is a powerful programming language that can make certain small teams hyper-productive. Scala can also slow productivity by drowning […]

Scala Spark vs Python PySpark: Which is better?

mrpowers February 8, 2021 1

Apache Spark code can be written with the Scala, Java, Python, or R APIs. Scala and Python are the most popular APIs. This blog post performs a detailed comparison of […]

Type 2 Slowly Changing Dimension Upserts with Delta Lake

mrpowers January 30, 2021 0

This post explains how to perform type 2 upserts for slowly changing dimension tables with Delta Lake. We’ll start out by covering the basics of type 2 SCDs and when […]

Spark Datasets: Advantages and Limitations

mrpowers January 27, 2021 0

Datasets are available to Spark Scala/Java users and offer more type safety than DataFrames. Python and R infer types during runtime, so these APIs cannot support the Datasets. This post […]

Calculating Month Start and End Dates with Spark

mrpowers January 2, 2021 0

This post shows how to create beginningOfMonthDate and endOfMonthDate functions by leveraging the native Spark datetime functions. The native Spark datetime functions are not easy to use, so it’s important […]

Calculating Week Start and Week End Dates with Spark

mrpowers December 31, 2020 1

You can use native Spark functions to compute the beginning and end dates for a week, but the code isn’t intuitive. This blog post demonstrates how to wrap the complex […]

Wrapping Java Code with Clean Scala Interfaces

mrpowers December 23, 2020 0

This post explains how to wrap a Java library with a Scala interface. You can instantiate Java classes directly in Scala, but it’s best to wrap the Java code, so […]