Building Spark Projects with Mill
Mill is a SBT alternative that can be used to build Spark projects.
This post explains how to create a Spark project with Mill and why you might want to use it instead of SBT.
Project structure
Here's is the directory structure of the mill_spark_example
project:
mill_spark_example/
foo/
src/
Example.scala
test/
src/
ExampleTests.scala
out/
build.sc
The build.sc
file specifies the project dependencies, similar to the build.sbt
file for SBT projects.
Running a test
Let's add a simple DataFrame transformation to the foo/src/Example.scala
file:
package foo
import org.apache.spark.sql.DataFrame
import org.apache.spark.sql.functions._
import com.github.mrpowers.spark.daria.sql.functions.removeAllWhitespace
object Example {
def withGreeting()(df: DataFrame): DataFrame = {
df.withColumn("greeting", removeAllWhitespace(lit("hello YOU !")))
}
}
Now let's add a test in foo/test/src/ExampleTests.scala
:
package foo
import utest._
import org.apache.spark.sql.types._
import org.apache.spark.sql.Row
import org.apache.spark.sql.functions._
import com.github.mrpowers.spark.fast.tests.DatasetComparer
import com.github.mrpowers.spark.daria.sql.SparkSessionExt._
object ExampleTests extends TestSuite with SparkSessionTestWrapper with DatasetComparer {
val tests = Tests {
import spark.implicits._
"withGreeting" - {
val sourceDF = spark.createDF(
List(
("jose"),
("li"),
("luisa")
), List(
("name", StringType, true)
)
)
val actualDF = sourceDF.transform(Example.withGreeting())
val expectedDF = Seq(
("jose", "helloYOU!"),
("li", "helloYOU!"),
("luisa", "helloYOU!")
).toDF("name", "greeting")
assertSmallDatasetEquality(actualDF, expectedDF, ignoreNullable = true)
}
}
}
The foo/src/Example.scala
needs access to Spark SQL and spark-daria. The foo/test/src/ExampleTests.scala
file needs access to Spark SQL, uTest, spark-daria, and spark-fast-tests.
Let's create a simple build.sc
file that specifies the application and test dependencies.
import mill._
import mill.scalalib._
import coursier.maven.MavenRepository
object foo extends ScalaModule {
def scalaVersion = "2.11.12"
def repositories = super.repositories ++ Seq(
MavenRepository("http://dl.bintray.com/spark-packages/maven")
)
def ivyDeps = Agg(
ivy"org.apache.spark::spark-sql:2.3.0",
ivy"mrpowers:spark-daria:0.26.1-s_2.11",
)
object test extends Tests{
def ivyDeps = Agg(
ivy"org.apache.spark::spark-sql:2.3.0",
ivy"com.lihaoyi::utest:0.6.0",
ivy"MrPowers:spark-fast-tests:0.17.1-s_2.11",
ivy"mrpowers:spark-daria:0.26.1-s_2.11",
)
def testFrameworks = Seq("utest.runner.Framework")
}
}
We can run our tests with the mill foo.test
command.
Thin JAR file
The mill foo.jar
command builds a thin JAR file that gets written out to out/foo/jar/dest/out.jar
.
We can use the jar
command to see that the JAR file only includes the Example
code:
$ jar tvf out/foo/jar/dest/out.jar
49 Thu Apr 04 11:21:08 EDT 2019 META-INF/MANIFEST.MF
1265 Thu Apr 04 09:24:30 EDT 2019 foo/Example$.class
843 Thu Apr 04 09:24:30 EDT 2019 foo/Example.class
Fat JAR file
The mill foo.assembly
command builds a fat JAR file that gets written out to out/foo/assembly/dest/out.jar
.
The fat JAR file contains all the Spark, Scala, spark-daria, and project classes.
Let's update the build.sc
file, so the assembly JAR file does not contain any Spark SQL or Scala classes.
import mill._
import mill.scalalib._
import mill.modules.Assembly
import coursier.maven.MavenRepository
object foo extends ScalaModule {
def scalaVersion = "2.11.12"
def repositories = super.repositories ++ Seq(
MavenRepository("http://dl.bintray.com/spark-packages/maven")
)
def compileIvyDeps = Agg(
ivy"org.apache.spark::spark-sql:2.3.0"
)
def ivyDeps = Agg(
ivy"mrpowers:spark-daria:0.26.1-s_2.11"
)
def assemblyRules = Assembly.defaultRules ++
Seq("scala/.*", "org\\.apache\\.spark/.*")
.map(Assembly.Rule.ExcludePattern.apply)
object test extends Tests{
def ivyDeps = Agg(
ivy"org.apache.spark::spark-sql:2.3.0",
ivy"com.lihaoyi::utest:0.6.0",
ivy"MrPowers:spark-fast-tests:0.17.1-s_2.11",
ivy"mrpowers:spark-daria:0.26.1-s_2.11",
)
def testFrameworks = Seq("utest.runner.Framework")
}
}
Is Mill better?
TODO
- Figure out if Mill runs a test suite faster
- Figure out if Mill can generate JAR files faster