Calculating Month Start and End Dates with Spark
This post shows how to create beginningOfMonthDate
and endOfMonthDate
functions by leveraging the native Spark datetime functions.
The native Spark datetime functions are not easy to use, so it's important to build abstractions on top of the standard lib. Using the standard lib functions directly results in code that's difficult to understand.
Month end
Spark has a function that calculates the last day of the month, but it's poorly named. Let's give the Spark function a more descriptive name so our code is readable.
def endOfMonthDate(col: Column): Column = {
last_day(col)
}
You can access this function via the spark-daria library if you don't want to define it yourself.
Suppose you have the following data:
+----------+
| some_date|
+----------+
|2016-09-10|
|2020-01-01|
|2016-01-10|
| null|
+----------+
Append an end_of_month
column to the DataFrame:
import com.github.mrpowers.spark.daria.sql.functions._
df
.withColumn("end_of_month", endOfMonthDate(col("some_date")))
.show()
Observe the results:
+----------+------------+
| some_date|end_of_month|
+----------+------------+
|2016-09-10| 2016-09-30|
|2020-01-01| 2020-01-31|
|2016-01-10| 2016-01-31|
| null| null|
+----------+------------+
Month start
You can calculate the start of the month with the trunc
or date_trunc
functions. Suppose you have the following DataFrame with a date column:
+----------+
| some_date|
+----------+
|2017-11-25|
|2017-12-21|
|2017-09-12|
| null|
+----------+
Here are the two different ways to calculate the beginning of the month:
datesDF
.withColumn("beginning_of_month_date", trunc(col("some_date"), "month"))
.withColumn("beginning_of_month_time", date_trunc("month" ,col("some_date")))
.show()
+----------+-----------------------+-----------------------+
| some_date|beginning_of_month_date|beginning_of_month_time|
+----------+-----------------------+-----------------------+
|2017-11-25| 2017-11-01| 2017-11-01 00:00:00|
|2017-12-21| 2017-12-01| 2017-12-01 00:00:00|
|2017-09-12| 2017-09-01| 2017-09-01 00:00:00|
| null| null| null|
+----------+-----------------------+-----------------------+
Important observations:
trunc
returns a date column anddate_trunc
returns a timestamp columntrunc
takescol("some_date")
as the first argument anddate_trunc
takescol("some_date")
as the second argument. They're inconsistent.date_trunc
sounds like it should be returning a date column. It's not named well.
Let's define beginningOfMonthDate
and beginningOfMonthTime
functions that are more intuitive.
def beginningOfMonthDate(col: Column): Column = {
trunc(col, "month")
}
def beginningOfMonthTime(col: Column): Column = {
date_trunc("month", col)
}
These functions let us write code that's easier to read:
datesDF
.withColumn("beginning_of_month_date", beginningOfMonthDate(col("some_date")))
.withColumn("beginning_of_month_time", beginningOfMonthTime(col("some_date")))
.show()
These functions are defined in spark-daria.
Next steps
Spark's standard datetime functions aren't the best, but they're still better than building UDFs with the java.time
library.
Using the spark-daria datetime abstractions is the best way to create readable code.
The spark-daria datetime function names are based on Rails, which is a well designed datetime library.
See this post for a detailed explanation on how spark-daria makes computing the week start / week end / next weekday easy. These are examples of core datetime functionality that should be abstracted in an open source library. You shouldn't need to reinvent the wheel and write core datetime logic in your applications.