Skip to content

Introduction to SBT for Spark Programmers

SBT is an interactive build tool that is used to run tests and package your projects as JAR files.

SBT lets you create a project in a text editor and package it, so it can be run in a cloud cluster computing environment (like Databricks).

SBT has a comprehensive Getting started guide, but let's be honest - who wants to read a book on a build tool?

This guide teaches Spark programmers what they need to know about SBT and skips all the other details!

Sample code

I recommend cloning the spark-daria project on your local machine, so you can run the SBT commands as you read this post.

Running SBT commands

SBT commands can be run from the command line or from the SBT shell.

For example, here's how to run the test suite from Bash: sbt test.

Alternatively, we can open the SBT shell by running sbt in Bash and then simply run test.

Run exit to leave the SBT shell.

build.sbt

The SBT build definition is specified in the build.sbt file.

This is where you'll add code to specify your dependencies, the Scala version, how to build your JAR files, how to manage memory, etc.

One of the only things that's not specified in the build.sbt file is the SBT version itself. The SBT version is specified in the project/build.properties file, for example:

sbt.version=1.2.8

libraryDependencies

You can specify libraryDependencies in your build.sbt file to fetch libraries from Maven or JitPack.

Here's how to add Spark SQL and Spark ML to a project:

libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.0" % "provided"
libraryDependencies += "org.apache.spark" %% "spark-mllib" % "2.4.0" % "provided"

SBT provides shortcut sytax if we'd like to clean up our build.sbt file a bit.

libraryDependencies ++= Seq(
  "org.apache.spark" %% "spark-sql" % "2.4.0" % "provided",
  "org.apache.spark" %% "spark-mllib" % "2.4.0" % "provided"
)

"provided" dependencies are already included in the environment where we run our code.

Here's an example of some test dependencies that are only used when we run our test suite:

libraryDependencies += "com.lihaoyi" %% "utest" % "0.6.3" % "test"
libraryDependencies += "MrPowers" % "spark-fast-tests" % "0.17.1-s_2.11" % "test"

Read this post on Building JAR files for a more detailed discussion on provided and test dependencies.

sbt test

You can run your test suite with the sbt test command.

You can set environment variables in your test suite by adding this line to your build.sbt file: envVars in Test := Map("PROJECT_ENV" -> "test"). Read the blog post on Environment Specific Config in Spark Projects for more details about this design pattern.

You can run a single test file when using Scalatest with this command:

sbt "test:testOnly *LoginServiceSpec"

This command is easier to run from the SBT shell:

> testOnly *LoginServiceSpec

Here is how to run a single test file when using uTest:

> testOnly -- com.github.mrpowers.spark.daria.sql.DataFrameExtTest

Complicated SBT commands are generally easier to run from the SBT shell, so you don't need to think about proper quoting.

Read this Stackoverflow thread if you'd like to run a single test with Scalatest and this blog post if you'd like to run a single test with uTest.

sbt doc

The sbt doc command generates HTML documentation for your project.

You can open the documentation on your local machine with open target/scala-2.11/api/index.html after it's been generated.

You should diligently mark functions and objects as private if they're not part of the API. sbt doc won't generate documentation for private members.

Codebases are always easier to understand when the public API is clearly defined.

For more information, read the spark-style-guide Documentation guidelines and the Documenting Spark Code with Scaladoc blog post.

sbt console

The sbt console command starts the Scala interpreter with easy access to all your project files.

Let's run sbt console in the spark-daria project and then invoke the StringHelpers.snakify() method.

scala> com.github.mrpowers.spark.daria.utils.StringHelpers.snakify("FunStuff") // fun_stuff

Running sbt console is similar to running the Spark shell with the spark-daria JAR file attached. Here's how to start the Spark shell with the spark-daria JAR file attached.

./spark-shell --jars ~/Documents/code/my_apps/spark-daria/target/scala-2.11/spark-daria-assembly-0.28.0.jar

The same code from before also works in the Spark shell:

scala> com.github.mrpowers.spark.daria.utils.StringHelpers.snakify("FunStuff") // fun_stuff

This blog post provides more details on how to use the Spark shell.

The sbt console is sometimes useful for playing around with code, but the test suite is usually better. Don't "test" your code in the console and neglect writing real tests.

sbt package / sbt assembly

sbt package builds a thin JAR file (only includes the project files). For spark-daria, the sbt package command builds the target/scala-2.11/spark-daria-0.28.0.jar file.

sbt assembly builds a fat JAR file (includes all the project and dependency files). For spark-daria, the sbt assembly command builds the target/scala-2.11/spark-daria-assembly-0.28.0.jar file.

Read this blog post on Building Spark JAR files for a detailed discussion on how sbt package and sbt assembly differ. To further customize JAR files, read this blog post on shading dependencies.

You should be comfortable with developing Spark code in a text editor, packaging your project as a JAR file, and attaching your JAR file to a cloud cluster for production analyses.

sbt clean

The sbt clean command deletes all of the generated files in the target/ directory.

This command will delete the documentation generated by sbt doc and will delete the JAR files generated by sbt package / sbt assembly.

It's good to run sbt clean frequently, so you don't accumlate a lot of legacy clutter in the target/ directory.

Next steps

SBT is a great build tool for Spark projects.

It lets you easily run tests, generate documentation, and package code as JAR files.

In a future post, we'll investigate how Mill can be used as a build tool for Spark projects.