This blog post explains how to compute the percentile, approximate percentile and median of a column in Spark.
There are a variety of different ways to perform these computations and it’s good to know all the approaches because they touch different important sections of the Spark API.
Percentile
You can calculate the exact percentile with the percentile
SQL function.
Suppose you have the following DataFrame:
+--------+ |some_int| +--------+ | 0| | 10| +--------+
Calculate the 50th percentile:
df .agg(expr("percentile(some_int, 0.5)").as("50_percentile")) .show()
+-------------+ |50_percentile| +-------------+ | 5.0| +-------------+
Using expr
to write SQL strings when using the Scala API isn’t ideal. It’s better to invoke Scala functions, but the percentile
function isn’t defined in the Scala API.
The bebe library fills in the Scala API gaps and provides easy access to functions like percentile.
df .agg(bebe_percentile(col("some_int"), lit(0.5)).as("50_percentile")) .show()
+-------------+ |50_percentile| +-------------+ | 5.0| +-------------+
bebe_percentile
is implemented as a Catalyst expression, so it’s just as performant as the SQL percentile function.
Approximate Percentile
Create a DataFrame with the integers between 1 and 1,000.
val df1 = (1 to 1000).toDF("some_int")
Use the approx_percentile
SQL method to calculate the 50th percentile:
df1 .agg(expr("approx_percentile(some_int, array(0.5))").as("approx_50_percentile")) .show()
+--------------------+ |approx_50_percentile| +--------------------+ | [500]| +--------------------+
This expr
hack isn’t ideal. We don’t like including SQL strings in our Scala code.
Let’s use the bebe_approx_percentile
method instead.
df1 .select(bebe_approx_percentile(col("some_int"), array(lit(0.5))).as("approx_50_percentile")) .show()
+--------------------+ |approx_50_percentile| +--------------------+ | [500]| +--------------------+
bebe lets you write code that’s a lot nicer and easier to reuse.
Median
The median is the value where fifty percent or the data values fall at or below it. Therefore, the median is the 50th percentile.
We’ve already seen how to calculate the 50th percentile, or median, both exactly and approximately.
Conclusion
The Spark percentile functions are exposed via the SQL API, but aren’t exposed via the Scala or Python APIs.
Invoking the SQL functions with the expr hack is possible, but not desirable. Formatting large SQL strings in Scala code is annoying, especially when writing code that’s sensitive to special characters (like a regular expression).
It’s best to leverage the bebe library when looking for this functionality. The bebe functions are performant and provide a clean interface for the user.
Comments are closed, but trackbacks and pingbacks are open.