exists and forall PySpark array functions
This blog post demonstrates how to find if any element in a PySpark array meets a condition with exists or if all elements in an array meet a condition with […]
This blog post demonstrates how to find if any element in a PySpark array meets a condition with exists or if all elements in an array meet a condition with […]
The summary and describe methods make it easy to explore the contents of a DataFrame at a high level. This post shows you how to use these methods. TL;DR – […]
This blog post explains how to compute the percentile, approximate percentile and median of a column in Spark. There are a variety of different ways to perform these computations and […]
This post explains why Scala projects are difficult to maintain. Scala is a powerful programming language that can make certain small teams hyper-productive. Scala can also slow productivity by drowning […]
Apache Spark code can be written with the Scala, Java, Python, or R APIs. Scala and Python are the most popular APIs. This blog post performs a detailed comparison of […]
This post explains how to perform type 2 upserts for slowly changing dimension tables with Delta Lake. We’ll start out by covering the basics of type 2 SCDs and when […]
Datasets are available to Spark Scala/Java users and offer more type safety than DataFrames. Python and R infer types during runtime, so these APIs cannot support the Datasets. This post […]
This post shows how to create beginningOfMonthDate and endOfMonthDate functions by leveraging the native Spark datetime functions. The native Spark datetime functions are not easy to use, so it’s important […]
You can use native Spark functions to compute the beginning and end dates for a week, but the code isn’t intuitive. This blog post demonstrates how to wrap the complex […]
This post explains how to wrap a Java library with a Scala interface. You can instantiate Java classes directly in Scala, but it’s best to wrap the Java code, so […]