Calculating Percentile, Approximate Percentile, and Median with Spark

This blog post explains how to compute the percentile, approximate percentile and median of a column in Spark.

There are a variety of different ways to perform these computations and it's good to know all the approaches because they touch different important sections of the Spark API.

Percentile

You can calculate the exact percentile with the percentile SQL function.

Suppose you have the following DataFrame:

+--------+
|some_int|
+--------+
|       0|
|      10|
+--------+

Calculate the 50th percentile:

df
  .agg(expr("percentile(some_int, 0.5)").as("50_percentile"))
  .show()

+-------------+
|50_percentile|
+-------------+
|          5.0|
+-------------+

Using expr to write SQL strings when using the Scala API isn't ideal. It's better to invoke Scala functions, but the percentile function isn't defined in the Scala API.

The bebe library fills in the Scala API gaps and provides easy access to functions like percentile.

df
  .agg(bebe_percentile(col("some_int"), lit(0.5)).as("50_percentile"))
  .show()

+-------------+
|50_percentile|
+-------------+
|          5.0|
+-------------+

bebe_percentile is implemented as a Catalyst expression, so it's just as performant as the SQL percentile function.

Approximate Percentile

Create a DataFrame with the integers between 1 and 1,000.

val df1 = (1 to 1000).toDF("some_int")

Use the approx_percentile SQL method to calculate the 50th percentile:

df1
  .agg(expr("approx_percentile(some_int, array(0.5))").as("approx_50_percentile"))
  .show()

+--------------------+
|approx_50_percentile|
+--------------------+
|               [500]|
+--------------------+

This expr hack isn't ideal. We don't like including SQL strings in our Scala code.

Let's use the bebe_approx_percentile method instead.

df1
  .select(bebe_approx_percentile(col("some_int"), array(lit(0.5))).as("approx_50_percentile"))
  .show()

+--------------------+
|approx_50_percentile|
+--------------------+
|               [500]|
+--------------------+

bebe lets you write code that's a lot nicer and easier to reuse.

Median

The median is the value where fifty percent or the data values fall at or below it. Therefore, the median is the 50th percentile.

Source

We've already seen how to calculate the 50th percentile, or median, both exactly and approximately.

Conclusion

The Spark percentile functions are exposed via the SQL API, but aren't exposed via the Scala or Python APIs.

Invoking the SQL functions with the expr hack is possible, but not desirable. Formatting large SQL strings in Scala code is annoying, especially when writing code that's sensitive to special characters (like a regular expression).

It's best to leverage the bebe library when looking for this functionality. The bebe functions are performant and provide a clean interface for the user.