Just Enough Scala for Spark Programmers

Spark programmers only need to know a small subset of the Scala API to be productive.

Scala has a reputation for being a difficult language to learn and that scares some developers away from Spark. This guide covers the Scala language features needed for Spark programmers.

Spark programmers need to know how to write Scala functions, encapsulate functions in objects, and namespace objects in packages. It’s not a lot to learn – I promise!

Scala function basics

This section describes how to write vanilla Scala functions and Spark SQL functions.

Here is a Scala function that adds two numbers:

def sum(num1: Int, num2: Int): Int = {
  num1 + num2
}

We can invoke this function as follows:

sum(10, 5) // returns 15

Let’s write a Spark SQL function that adds two numbers together:

import org.apache.spark.sql.Column

def sumColumns(num1: Column, num2: Column): Column = {
  num1 + num2
}

Let’s create a DataFrame in the Spark shell and run the sumColumns() function.

val numbersDF = Seq(
  (10, 4),
  (3, 4),
  (8, 4)
).toDF("some_num", "another_num")

numbersDF
  .withColumn(
    "the_sum",
    sumColumns(col("some_num"), col("another_num"))
  )
  .show()

+--------+-----------+-------+
|some_num|another_num|the_sum|
+--------+-----------+-------+
|      10|          4|     14|
|       3|          4|      7|
|       8|          4|     12|
+--------+-----------+-------+

Spark SQL functions take org.apache.spark.sql.Column arguments whereas vanilla Scala functions take native Scala data type arguments like Int or String.

Currying functions

Scala allows for functions to take multiple parameter lists, which is formally known as currying. This section explains how to use currying with vanilla Scala functions and why currying is important for Spark programmers.

def myConcat(word1: String)(word2: String): String = {
  word1 + word2
}

Here’s how to invoke the myConcat() function.

myConcat("beautiful ")("picture") // returns "beautiful picture"

myConcat() is invoked with two sets of arguments.

Spark has a Dataset#transform() method that makes it easy to chain DataFrame transformations.

Here’s an example of a DataFrame transformation function:

import org.apache.spark.sql.DataFrame

def withCat(name: String)(df: DataFrame): DataFrame = {
  df.withColumn("cat", lit(s"$name meow"))
}

DataFrame transformation functions can take an arbitrary number of arguments in the first parameter list and must take a single DataFrame argument in the second parameter list.

Let’s create a DataFrame in the Spark shell and run the withCat() function.

val stuffDF = Seq(
  ("chair"),
  ("hair"),
  ("bear")
).toDF("thing")

stuffDF
  .transform(withCat("darla"))
  .show()

+-----+----------+
|thing|       cat|
+-----+----------+
|chair|darla meow|
| hair|darla meow|
| bear|darla meow|
+-----+----------+

Most Spark code can be organized as Spark SQL functions or as custom DataFrame transformations.

`object`

Spark functions can be stored in objects.

Let’s create a SomethingWeird object that defines a vanilla Scala function, a Spark SQL function, and a custom DataFrame transformation.

import org.apache.spark.sql.functions._

object SomethingWeird {

  // vanilla Scala function
  def hi(): String = {
    "welcome to planet earth"
  }

  // Spark SQL function
  def trimUpper(col: Column) = {
    trim(upper(col))
  }

  // custom DataFrame transformation
  def withScary()(df: DataFrame): DataFrame = {
    df.withColumn("scary", lit("boo!"))
  }

}

Let’s create a DataFrame in the Spark shell and run the trimUpper() and withScary() functions.

val wordsDF = Seq(
  ("niCE"),
  ("  CaR"),
  ("BAR  ")
).toDF("word")

wordsDF
  .withColumn("trim_upper_word", SomethingWeird.trimUpper(col("word")))
  .transform(SomethingWeird.withScary())
  .show()

+-----+---------------+-----+
| word|trim_upper_word|scary|
+-----+---------------+-----+
| niCE|           NICE| boo!|
|  CaR|            CAR| boo!|
|BAR  |            BAR| boo!|
+-----+---------------+-----+

Objects are useful for grouping related Spark functions.

`trait`

Traits can be mixed into objects to add commonly used methods or values. We can define a SparkSessionWrapper trait that defines a spark variable to give objects easy access to the SparkSession object.

import org.apache.spark.sql.SparkSession

trait SparkSessionWrapper extends Serializable {

  lazy val spark: SparkSession = {
    SparkSession.builder().master("local").appName("spark session").getOrCreate()
  }

}

The Serializable trait is mixed into the SparkSessionWrapper trait.

Let’s create a SpecialDataLake object that mixes in the SparkSessionWrapper trait to provide easy access to a data lake.

object SpecialDataLake extends SparkSessionWrapper {

  def dataLake(): DataFrame = {
    spark.read.parquet("some_secret_s3_path")
  }

}

`package`

Packages are used to namespace Scala code. Per the Databricks Scala style guide, packages should follow Java naming conventions.

For example, the Databricks spark-redshift project uses the com.databricks.spark.redshift namespace.

The Spark project used the org.apache.spark namespace. spark-daria uses the com.github.mrpowers.spark.daria namespace.

Here an example of code that’s defined in a package in spark-daria:

package com.github.mrpowers.spark.daria.sql

import org.apache.spark.sql.Column
import org.apache.spark.sql.functions._

object functions {

  def singleSpace(col: Column): Column = {
    trim(regexp_replace(col, " +", " "))
  }

}

The package structure should mimic the file structure of the project.

Implicit classes

Implicit classes can be used to extend Spark core classes with additional methods.

Let’s add a lower() method to the Column class that converts all the strings in a column to lower case.

package com.github.mrpowers.spark.daria.sql

import org.apache.spark.sql.Column

object FunctionsAsColumnExt {

  implicit class ColumnMethods(col: Column) {

    def lower(): Column = {
      org.apache.spark.sql.functions.lower(col)
    }

  }

}

After running import com.github.mrpowers.spark.daria.sql.FunctionsAsColumnExt._, you can run the lower() method directly on column objects.

col("some_string").lower()

Implicit classes should be avoided in general. I only monkey patch core classes in the spark-daria project. Feel free to send pull requests if you have any good ideas for other extensions.

Next steps

There are a couple of other Scala features that are useful when writing Spark code, but this blog post covers 90%+ of common use cases.

You don’t need to understand functional programming or advanced Scala language features to be a productive Spark programmer.

In fact, staying away from UDFs and native Scala code is a best practice.

Focus on mastering the native Spark API and you’ll be a productive big data engineer in no time!

MungingData

Piles of precious data