Spark Datasets: Advantages and Limitations
Datasets are available to Spark Scala/Java users and offer more type safety than DataFrames.
Python and R infer types during runtime, so these APIs cannot support the Datasets.
This post demonstrates how to create Datasets and describes the advantages of this data structure.
toDS
Create a City
case class, instantiate some objects, and then build a Dataset:
case class City(englishName: String, continent: String)
val cities = Seq(
City("bejing", "asia"),
City("new york", "north america"),
City("paris", "europe")
).toDS()
cities.show()
+-----------+-------------+
|englishName| continent|
+-----------+-------------+
| bejing| asia|
| new york|north america|
| paris| europe|
+-----------+-------------+
The cities
Dataset is of type org.apache.spark.sql.Dataset[City]
.
createDataset
The cities
Dataset can also be created with the createDataset
method:
case class City(englishName: String, continent: String)
val cities2 = spark.createDataset(
Seq(
City("bejing", "asia"),
City("new york", "north america"),
City("paris", "europe")
)
)
cities2
is also of type org.apache.spark.sql.Dataset[City]
.
Converting from DataFrame to Dataset
Let's create a DataFrame of trees and then convert it to a Dataset. Start by creating the DataFrame.
val treesDF = Seq(
("Oak", "deciduous"),
("Hemlock", "evergreen"),
("Apple", "angiosperms")
).toDF("tree_name", "tree_type")
treesDF.show()
+---------+-----------+
|tree_name| tree_type|
+---------+-----------+
| Oak| deciduous|
| Hemlock| evergreen|
| Apple|angiosperms|
+---------+-----------+
treesDF
is a org.apache.spark.sql.DataFrame
.
Define a case class and use as
to convert the DataFrame to a Dataset.
case class Tree(tree_name: String, tree_type: String)
val treesDS = treesDF.as[Tree]
treesDS
is a org.apache.spark.sql.Dataset[Tree]
.
DataFrame is an alias for Dataset[Row]
DataFrame is defined as a Dataset[Row] in the Spark codebase with this line: type DataFrame = Dataset[Row]
.
org.apache.spark.sql.Row
is a generic object that can be instantiated with any arguments.
import org.apache.spark.sql.Row
val oneRow = Row("hi", 34)
val anotherRow = Row(34.2, "cool", 4)
case classes cannot be instantiated with any arguments. This code will error out:
case class Furniture(furniture_type: String, color: String)
Furniture("bed", 33)
Here's the error message:
error: type mismatch;
found : Int(33)
required: String
Furniture("bed", 33)
^
Scala throws a compile-time error when you try to instantiate an object with the wrong type.
Your text editor will complain about this code, so you don't need to wait until runtime to discover the error.
Runtime vs compile-time errors
Here's some invalid code to create a DataFrame that'll error-out at runtime:
val shoesDF = Seq(
("nike", "black"),
("puma", 42)
).toDF("brand", "color")
Here's the runtime error: java.lang.ClassNotFoundException: scala.Any
.
Note that the runtime error is not descriptive, so the bug is hard to trace. The runtime error isn't caught by your text editor either.
Let's write some similarly invalid code to create a Dataset.
case class Shoe(brand: String, color: String)
val shoesDS = Seq(
Shoe("nike", "black"),
Shoe("puma", 42)
).toDS()
The Dataset API gives a much better error message:
error: type mismatch;
found : Int(42)
required: String
Shoe("puma", 42)
^
This is a compile-time error, so it'll be caught by your text editor.
Advantages of Datasets
Datasets catch some bugs at compile-time that aren't caught by DataFrames till runtime.
Runtime bugs can be a real nuisance for big data jobs.
You don't want to run a job for 4 hours, only to have it error out with a silly runtime bug. It's better to catch bugs in your text editor, before they become production job errors.
Disadvantages of Datasets
Spark automatically converts Datasets to DataFrames when performing operations like adding columns.
Adding columns is a common operation. You can go through the effort of defining a case class to build a Dataset, but all that type safety is lost with a simple withColumn
operation.
Here's an example:
case class Sport(name: String, uses_ball: Boolean)
val sportsDS = Seq(
Sport("basketball", true),
Sport("polo", true),
Sport("hockey", false)
).toDS()
sportsDS
is of type org.apache.spark.sql.Dataset[Sport]
.
Append a short_name
column to the Dataset and view the results.
import org.apache.spark.sql.functions._
val res = sportsDS.withColumn("short_name", substring($"name", 1, 3))
res.show()
+----------+---------+----------+
| name|uses_ball|short_name|
+----------+---------+----------+
|basketball| true| bas|
| polo| true| pol|
| hockey| false| hoc|
+----------+---------+----------+
res
is of type org.apache.spark.sql.DataFrame
.
We'd need to define another case class and convert res
back to a Dataset if we'd like to get the type safety benefits back.
Not all operations convert Datasets to DataFrames. For example, filtering does not convert Datasets to DataFrames:
val nonBallSports = sportsDS.where($"uses_ball" === false)
nonBallSports
is still of type org.apache.spark.sql.Dataset[Sport]
.
Typed Datasets
The frameless library offers typed Datasets that are even more type safe than Spark Datasets, but typed Datasets face even more limitations.
Lots of Spark workflows operate on wide tables with transformations that append tens or hundreds of columns. Creating a new case class whenever a new column is added isn't practical for most Spark workflows.
Why PySpark doesn't have Datasets
We've demonstrated that Scala will throw compile-time errors when case classes are instantiated with invalid arguments.
Python is not compile-time type safe, so it throws runtime exceptions when classes are instantiated with invalid arguments. The Dataset API cannot be added to PySpark because of this Python language limitation.
Spark Datasets aren't so type safe
Some nonsensical operations are caught at runtime
Create a Dataset with an integer column and try to add four months to the integer.
case class Cat(name: String, favorite_number: Int)
val catsDS = Seq(
Cat("fluffy", 45)
).toDS()
catsDS.withColumn("meaningless", add_months($"favorite_number", 4)).show()
Here's the error message: org.apache.spark.sql.AnalysisException: cannot resolve 'add_months(favorite_number
, 4)' due to data type mismatch: argument 1 requires date type, however, 'favorite_number
' is of int type.;;
AnalysisExceptions are thrown at runtime, so this isn't a compile-time error that you'd expect from a type safe API.
Other nonsensical operations return null
Let's run the date_trunc
function on a StringType
column and observe the result.
catsDS.withColumn("meaningless", date_trunc("name", lit("cat"))).show()
+------+---------------+-----------+
| name|favorite_number|meaningless|
+------+---------------+-----------+
|fluffy| 45| null|
+------+---------------+-----------+
Some Spark functions just return null
when the operation is meaningless. lit("cat")
isn't a valid format, so this operation will always return null
.
Some operations return meaningless results
Let's create a Dataset with a date column and then reverse the date:
import java.sql.Date
case class Birth(hospitalName: String, birthDate: Date)
val birthsDS = Seq(
Birth("westchester", Date.valueOf("2014-01-15"))
).toDS()
birthsDS.withColumn("meaningless", reverse($"birthDate")).show()
+------------+----------+-----------+
|hospitalName| birthDate|meaningless|
+------------+----------+-----------+
| westchester|2014-01-15| 51-10-4102|
+------------+----------+-----------+
At the very least, we'd expect this code to error out at runtime with a org.apache.spark.sql.AnalysisException
.
A type-safe implementation would throw a compile time error when the reverse
function is passed a DateType
value.
Conclusion
Spark Datasets offer more type safety than DataFrames, but they're hard to stick with. Spark will automatically convert your Datasets to DataFrames when you perform common operations, like adding a column.
Even when you're using Datasets, you don't get much type safety. The org.apache.spark.sql.Column
objects don't have type information, so they're akin to Scala Any
values.
Some org.apache.spark.sql.functions
throw AnalysisException
when they're supplied with meaningless input, but others just return junk responses. Diligently test your Spark code to catch bugs before making production deploys.