Introduction to Spark DataFrames
Spark DataFrames are similar to tables in relational databases - they store data in columns and rows and support a variety of operations to manipulate the data.
Here's an example of a DataFrame that contains information about cities.
city | country | population |
---|---|---|
Boston | USA | 0.67 |
Dubai | UAE | 3.1 |
Cordoba | Argentina | 1.39 |
This blog post will discuss creating DataFrames, defining schemas, adding columns, and filtering rows.
Creating DataFrames
You can import spark implicits and create a DataFrame with the toDF()
method.
import spark.implicits._
val df = Seq(
("Boston", "USA", 0.67),
("Dubai", "UAE", 3.1),
("Cordoba", "Argentina", 1.39)
).toDF("city", "country", "population")
You can view the contents of a DataFrame with the show()
method.
df.show()
+-------+---------+----------+
| city| country|population|
+-------+---------+----------+
| Boston| USA| 0.67|
| Dubai| UAE| 3.1|
|Cordoba|Argentina| 1.39|
+-------+---------+----------+
Each DataFrame column has name
, dataType
and nullable
properties. The column can contain null
values if the nullable
property is set to true
.
The printSchema()
method provides an easily readable view of the DataFrame schema.
df.printSchema()
root
|-- city: string (nullable = true)
|-- country: string (nullable = true)
|-- population: double (nullable = false)
Adding columns
Columns can be added to a DataFrame with the withColumn()
method.
Let's add an is_big_city
column to the DataFrame that returns true
if the city contains more than one million people.
import org.apache.spark.sql.functions.col
val df2 = df.withColumn("is_big_city", col("population") > 1)
df2.show()
+-------+---------+----------+-----------+
| city| country|population|is_big_city|
+-------+---------+----------+-----------+
| Boston| USA| 0.67| false|
| Dubai| UAE| 3.1| true|
|Cordoba|Argentina| 1.39| true|
+-------+---------+----------+-----------+
DataFrames are immutable, so the withColumn()
method returns a new DataFrame. withColumn()
does not mutate the original DataFrame. Let's confirm that df
is still the same with df.show()
.
+-------+---------+----------+
| city| country|population|
+-------+---------+----------+
| Boston| USA| 0.67|
| Dubai| UAE| 3.1|
|Cordoba|Argentina| 1.39|
+-------+---------+----------+
df
does not contain the is_big_city
column, so we've confirmed that withColumn()
did not mutate df
.
Filtering rows
The filter()
method removes rows from a DataFrame.
df.filter(col("population") > 1).show()
+-------+---------+----------+
| city| country|population|
+-------+---------+----------+
| Dubai| UAE| 3.1|
|Cordoba|Argentina| 1.39|
+-------+---------+----------+
It's a little hard to read code with multiple method calls on the same line, so let's break this code up on multiple lines.
df
.filter(col("population") > 1)
.show()
We can also assign the filtered DataFrame to a separate variable rather than chaining method calls.
val filteredDF = df.filter(col("population") > 1)
filteredDF.show()
More on schemas
As previously discussed, the DataFrame schema can be pretty printed to the console with the printSchema()
method. The schema
method returns a code representation of the DataFrame schema.
df.schema
StructType(
StructField(city, StringType, true),
StructField(country, StringType, true),
StructField(population, DoubleType, false)
)
Each column of a Spark DataFrame is modeled as a StructField
object with name, columnType, and nullable properties. The entire DataFrame schema is modeled as a StructType
, which is a collection of StructField
objects.
Let's create a schema for a DataFrame that has first_name
and age
columns.
import org.apache.spark.sql.types._
StructType(
Seq(
StructField("first_name", StringType, true),
StructField("age", DoubleType, true)
)
)
Spark's programming interface makes it easy to define the exact schema you'd like for your DataFrames.
Creating DataFrames with createDataFrame()
The toDF()
method for creating Spark DataFrames is quick, but it's limited because it doesn't let you define your schema (it infers the schema for you). The createDataFrame()
method lets you define your DataFrame schema.
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
val animalData = Seq(
Row(30, "bat"),
Row(2, "mouse"),
Row(25, "horse")
)
val animalSchema = List(
StructField("average_lifespan", IntegerType, true),
StructField("animal_type", StringType, true)
)
val animalDF = spark.createDataFrame(
spark.sparkContext.parallelize(animalData),
StructType(animalSchema)
)
animalDF.show()
+----------------+-----------+
|average_lifespan|animal_type|
+----------------+-----------+
| 30| bat|
| 2| mouse|
| 25| horse|
+----------------+-----------+
Read this blog post if you'd like more information on different approaches to create Spark DataFrames.
We can use the animalDF.printSchema()
method to confirm that the schema was created as specified.
root
|-- average_lifespan: integer (nullable = true)
|-- animal_type: string (nullable = true)
Next Steps
DataFrames are the fundamental building blocks of Spark. All machine learning and streaming analyses are built on top of the DataFrame API. Make sure you master DataFrames before diving in to more advanced parts of the Spark API.