Exploring DataFrames with summary and describe
The summary
and describe
methods make it easy to explore the contents of a DataFrame at a high level.
This post shows you how to use these methods.
TL;DR - summary
is more useful than describe
. You can get the same result with agg
, but summary
will save you from writing a lot of code.
describe
Suppose you have the following DataFrame.
+----+-------+
|num1|letters|
+----+-------+
| 1| aa|
| 2| aa|
| 9| bb|
| 5| cc|
+----+-------+
Use describe
to compute some summary statistics on the DataFrame.
df.describe().show()
+-------+-----------------+-------+
|summary| num1|letters|
+-------+-----------------+-------+
| count| 4| 4|
| mean| 4.25| null|
| stddev|3.593976442141304| null|
| min| 1| aa|
| max| 9| cc|
+-------+-----------------+-------+
You can limit the describe
statistics for a subset of columns:
df.describe("num1").show()
+-------+-----------------+
|summary| num1|
+-------+-----------------+
| count| 4|
| mean| 4.25|
| stddev|3.593976442141304|
| min| 1|
| max| 9|
+-------+-----------------+
This option isn't very useful. df.select("num1").describe().show()
would give the same result and is more consistent with the rest of the Spark API.
Let's turn out attention to summary
, a better designed method that provides more useful options.
summary
Suppose you have the same starting DataFrame from before.
+----+-------+
|num1|letters|
+----+-------+
| 1| aa|
| 2| aa|
| 9| bb|
| 5| cc|
+----+-------+
Calculate the summary statistics for all columns in the DataFrame.
df.summary().show()
+-------+-----------------+-------+
|summary| num1|letters|
+-------+-----------------+-------+
| count| 4| 4|
| mean| 4.25| null|
| stddev|3.593976442141304| null|
| min| 1| aa|
| 25%| 1| null|
| 50%| 2| null|
| 75%| 5| null|
| max| 9| cc|
+-------+-----------------+-------+
Let's customize the output to return the count, 33rd percentile, 50th percentile, and 66th percentile.
df.summary("count", "33%", "50%", "66%").show()
+-------+----+-------+
|summary|num1|letters|
+-------+----+-------+
| count| 4| 4|
| 33%| 2| null|
| 50%| 2| null|
| 66%| 5| null|
+-------+----+-------+
Limit the custom summary to the num1
column cause it doesn't make sense to compute percentiles for string columns.
df.select("num1").summary("count", "33%", "50%", "66%").show()
+-------+----+
|summary|num1|
+-------+----+
| count| 4|
| 33%| 2|
| 50%| 2|
| 66%| 5|
+-------+----+
I worked with the Spark core team to add some additional options to this method, so as of Spark 3.2, you'll also be able to compute the exact and approximate count distinct.
Here's how to get the exact count and distinct count for each column:
df.summary("count", "count_distinct").show()
Here's how to get the approximate count distinct, which will run faster:
df.summary("count", "approx_count_distinct").show()
agg
We can use agg
to manually compute the summary statistics for columns in the DataFrame. Here's how to calculate the distinct count for each column in the DataFrame.
df.agg(countDistinct("num1"), countDistinct("letters")).show()
+-----------+--------------+
|count(num1)|count(letters)|
+-----------+--------------+
| 4| 3|
+-----------+--------------+
Here's how to calculate the distinct count and the max for each column in the DataFrame:
val counts = df.agg(
lit("countDistinct").as("colName"),
countDistinct("num1").as("num1"),
countDistinct("letters").as("letters"))
val maxes = df.agg(
lit("max").as("colName"),
max("num1").as("num1"),
max("letters").as("letters"))
counts.union(maxes).show()
+-------------+----+-------+
| colName|num1|letters|
+-------------+----+-------+
|countDistinct| 4| 3|
| max| 9| cc|
+-------------+----+-------+
The code gets verbose quick. summary
is great cause it prevents you from writing a lot of code.
Conclusion
summary
is great for high level exploratory data analysis.
For more detailed exploratory data analysis, see the deequ library.
Ping me if you're interested and I'll add an extensible version of summary
to spark-daria. We should have a dariaSummary
method that can be invoked like this df.dariaSummary(countDistinct, max)
and automatically generate a custom summary, without forcing the user to write tons of code.