Working with Spark MapType Columns

Spark DataFrame columns support maps, which are great for key / value pairs with an arbitrary length.

This blog post describes how to create MapType columns, demonstrates built-in functions to manipulate MapType columns, and explain when to use maps in your analyses.

Make sure to read Writing Beautiful Spark Code for a detailed overview of how to use MapType columns in production applications.

Scala maps

Let's begin with a little refresher on Scala maps.

Create a Scala map that connects some English and Spanish words.

val wordMapping = Map("one" -> "uno", "dog" -> "perro")

Fetch the value associated with the dog key:

wordMapping("dog") // "perro"

Creating MapType columns

Let's create a DataFrame with a MapType column.

val singersDF = spark.createDF(
  List(
    ("sublime", Map(
      "good_song" -> "santeria",
      "bad_song" -> "doesn't exist")
    ),
    ("prince_royce", Map(
      "good_song" -> "darte un beso",
      "bad_song" -> "back it up")
    )
  ), List(
    ("name", StringType, true),
    ("songs", MapType(StringType, StringType, true), true)
  )
)

singersDF.show(false)

+------------+----------------------------------------------------+
|name        |songs                                               |
+------------+----------------------------------------------------+
|sublime     |[good_song -> santeria, bad_song -> doesn't exist]  |
|prince_royce|[good_song -> darte un beso, bad_song -> back it up]|
+------------+----------------------------------------------------+

Let's examine the DataFrame schema and verify that the songs column has a MapType:

singersDF.printSchema()

root
 |-- name: string (nullable = true)
 |-- songs: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

We can see that songs is a MapType column.

Let's explore some built-in Spark methods that make it easy to work with MapType columns.

Fetching values from maps with element_at()

Let's use the singersDF DataFrame and append song_to_love as a column.

singersDF
  .withColumn("song_to_love", element_at(col("songs"), "good_song"))
  .show(false)

+------------+----------------------------------------------------+-------------+
|name        |songs                                               |song_to_love |
+------------+----------------------------------------------------+-------------+
|sublime     |[good_song -> santeria, bad_song -> doesn't exist]  |santeria     |
|prince_royce|[good_song -> darte un beso, bad_song -> back it up]|darte un beso|
+------------+----------------------------------------------------+-------------+

The element_at() function fetches a value from a MapType column.

Appending MapType columns

We can use the map() method defined in org.apache.spark.sql.functions to append a MapType column to a DataFrame.

val countriesDF = spark.createDF(
  List(
    ("costa_rica", "sloth"),
    ("nepal", "red_panda")
  ), List(
    ("country_name", StringType, true),
    ("cute_animal", StringType, true)
  )
).withColumn(
  "some_map",
  map(col("country_name"), col("cute_animal"))
)

countriesDF.show(false)

+------------+-----------+---------------------+
|country_name|cute_animal|some_map             |
+------------+-----------+---------------------+
|costa_rica  |sloth      |[costa_rica -> sloth]|
|nepal       |red_panda  |[nepal -> red_panda] |
+------------+-----------+---------------------+

Let's verify that some_map is a MapType column:

countriesDF.printSchema()

root
 |-- country_name: string (nullable = true)
 |-- cute_animal: string (nullable = true)
 |-- some_map: map (nullable = false)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

Creating MapType columns from two ArrayType columns

We can create a MapType column from two ArrayType columns.

val df = spark.createDF(
  List(
    (Array("a", "b"), Array(1, 2)),
    (Array("x", "y"), Array(33, 44))
  ), List(
    ("letters", ArrayType(StringType, true), true),
    ("numbers", ArrayType(IntegerType, true), true)
  )
).withColumn(
  "strange_map",
  map_from_arrays(col("letters"), col("numbers"))
)

df.show(false)

+-------+--------+------------------+
|letters|numbers |strange_map       |
+-------+--------+------------------+
|[a, b] |[1, 2]  |[a -> 1, b -> 2]  |
|[x, y] |[33, 44]|[x -> 33, y -> 44]|
+-------+--------+------------------+

Let's take a look at the df schema and verify strange_map is a MapType column:

df.printSchema()

 |-- letters: array (nullable = true)
 |    |-- element: string (containsNull = true)
 |-- numbers: array (nullable = true)
 |    |-- element: integer (containsNull = true)
 |-- strange_map: map (nullable = true)
 |    |-- key: string
 |    |-- value: integer (valueContainsNull = true)

The Spark way of converting to arrays to a map is different that the "regular Scala" way of converting two arrays to a map.

Converting Arrays to Maps with Scala

Here's how you'd convert two collections to a map with Scala.

val list1 = List("a", "b")
val list2 = List(1, 2)

list1.zip(list2).toMap // Map(a -> 1, b -> 2)

We could wrap this code in a User Defined Function and define our own map_from_arrays function if we wanted.

In general, it's best to rely on the standard Spark library instead of defining our own UDFs.

The key takeaway is that the Spark way of solving a problem is often different from the Scala way. Read the API docs and always try to solve your problems the Spark way.

Merging maps with map_concat()

map_concat() can be used to combine multiple MapType columns to a single MapType column.

val df = spark.createDF(
  List(
    (Map("a" -> "aaa", "b" -> "bbb"), Map("c" -> "ccc", "d" -> "ddd"))
  ), List(
    ("some_data", MapType(StringType, StringType, true), true),
    ("more_data", MapType(StringType, StringType, true), true)
  )
)

df
  .withColumn("all_data", map_concat(col("some_data"), col("more_data")))
  .show(false)

+--------------------+--------------------+----------------------------------------+
|some_data           |more_data           |all_data                                |
+--------------------+--------------------+----------------------------------------+
|[a -> aaa, b -> bbb]|[c -> ccc, d -> ddd]|[a -> aaa, b -> bbb, c -> ccc, d -> ddd]|
+--------------------+--------------------+----------------------------------------+

Using StructType columns instead of MapType columns

Let's create a DataFrame that stores information about athletes.

val athletesDF = spark.createDF(
  List(
    ("lebron",
      Map(
        "height" -> "6.67",
        "units" -> "feet"
      )
    ),
    ("messi",
      Map(
        "height" -> "1.7",
        "units" -> "meters"
      )
    )
  ), List(
    ("name", StringType, true),
    ("stature", MapType(StringType, StringType, true), true)
  )
)

athletesDF.show(false)

+------+--------------------------------+
|name  |stature                         |
+------+--------------------------------+
|lebron|[height -> 6.67, units -> feet] |
|messi |[height -> 1.7, units -> meters]|
+------+--------------------------------+

athletesDF.printSchema()

root
 |-- name: string (nullable = true)
 |-- stature: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

stature is a MapType column, but we can also store stature as a StructType column.

val data = Seq(
  Row("lebron", Row("6.67", "feet")),
  Row("messi", Row("1.7", "meters"))
)

val schema = StructType(
  List(
    StructField("player_name", StringType, true),
    StructField(
      "stature",
      StructType(
        List(
          StructField("height", StringType, true),
          StructField("unit", StringType, true)
        )
      ),
      true
    )
  )
)

val athletesDF = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  schema
)

athletesDF.show(false)

+-----------+-------------+
|player_name|stature      |
+-----------+-------------+
|lebron     |[6.67, feet] |
|messi      |[1.7, meters]|
+-----------+-------------+

athletesDF.printSchema()

root
 |-- player_name: string (nullable = true)
 |-- stature: struct (nullable = true)
 |    |-- height: string (nullable = true)
 |    |-- unit: string (nullable = true)

Sometimes both StructType and MapType columns can solve the same problem and you can choose between the two.

Writing MapType columns to disk

The CSV file format cannot handle MapType columns.

This code will error out.

val outputPath = new java.io.File("./tmp/csv_with_map/").getCanonicalPath

spark.createDF(
  List(
    (Map("a" -> "aaa", "b" -> "bbb"))
  ), List(
    ("some_data", MapType(StringType, StringType, true), true)
  )
).write.csv(outputPath)

Here's the error message:

writing to disk
- cannot write maps to disk with the CSV format *** FAILED ***
  org.apache.spark.sql.AnalysisException: CSV data source does not support map<string,string> data type.;
  at org.apache.spark.sql.execution.datasources.DataSourceUtils$$anonfun$verifySchema$1.apply(DataSourceUtils.scala:69)
  at org.apache.spark.sql.execution.datasources.DataSourceUtils$$anonfun$verifySchema$1.apply(DataSourceUtils.scala:67)
  at scala.collection.Iterator$class.foreach(Iterator.scala:891)
  at scala.collection.AbstractIterator.foreach(Iterator.scala:1334)
  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
  at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
  at org.apache.spark.sql.execution.datasources.DataSourceUtils$.verifySchema(DataSourceUtils.scala:67)
  at org.apache.spark.sql.execution.datasources.DataSourceUtils$.verifyWriteSchema(DataSourceUtils.scala:34)
  at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:100)
  at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)

MapType columns can be written out with the Parquet file format. This code runs just fine:

val outputPath = new java.io.File("./tmp/csv_with_map/").getCanonicalPath

spark.createDF(
  List(
    (Map("a" -> "aaa", "b" -> "bbb"))
  ), List(
    ("some_data", MapType(StringType, StringType, true), true)
  )
).write.parquet(outputPath)

Conclusion

MapType columns are a great way to store key / value pairs of arbitrary lengths in a DataFrame column.

Spark 2.4 added a lot of native functions that make it easier to work with MapType columns. Prior to Spark 2.4, developers were overly reliant on UDFs for manipulating MapType columns.

StructType columns can often be used instead of a MapType column. Study both of these column types closely so you can understand the tradeoffs and intelligently select the best column type for your analysis.