Environment Specific Config in Spark Scala Projects
Environment config files return different values for the test, development, staging, and production environments.
In Spark projects, you will often want a variable to point to a local CSV file in the test environment and a CSV file in S3 in the production environment.
This episode will demonstrate how to add environment config to your projects and how to set environment variables to change the environment.
Basic use case
Let's create a Config
object with one Map[String, String]
with test configuration and another Map[String, String]
with production config.
package com.github.mrpowers.spark.spec.sql
object Config {
var test: Map[String, String] = {
Map(
"libsvmData" -> new java.io.File("./src/test/resources/sample_libsvm_data.txt").getCanonicalPath,
"somethingElse" -> "hi"
)
}
var production: Map[String, String] = {
Map(
"libsvmData" -> "s3a://my-cool-bucket/fun-data/libsvm.txt",
"somethingElse" -> "whatever"
)
}
var environment = sys.env.getOrElse("PROJECT_ENV", "production")
def get(key: String): String = {
if (environment == "test") {
test(key)
} else {
production(key)
}
}
}
The Config.get()
method will grab values from the test
or production
map depending on the PROJECT_ENV
value.
Let's use the sbt console
command to demonstrate this.
$ PROJECT_ENV=test sbt console
scala> com.github.mrpowers.spark.spec.sql.Config.get("somethingElse")
res0: String = hi
Let's restart the SBT console and run the same code in the production environment.
$ PROJECT_ENV=production sbt console
scala> com.github.mrpowers.spark.spec.sql.Config.get("somethingElse")
res0: String = whatever
Here is how the Config
object can be used to fetch a file in your GitHub repository in the test environment and also fetch a file from S3 in the production environment.
val training = spark
.read
.format("libsvm")
.load(Config.get("libsvmData"))
This solution is elegant and does not clutter our application code with environment logic.
Environment specific code anitpattern
Here is an example of how you should not add environment paths to your code.
var environment = sys.env.getOrElse("PROJECT_ENV", "production")
val training = if (environment == "test") {
spark
.read
.format("libsvm")
.load(new java.io.File("./src/test/resources/sample_libsvm_data.txt").getCanonicalPath)
} else {
spark
.read
.format("libsvm")
.load("s3a://my-cool-bucket/fun-data/libsvm.txt")
}
An anti-pattern is a common response to a recurring problem that is usually ineffective and risks being highly counterproductive. - source
You should never write code with different execution paths in the production and test environments because then your test suite won't really be testing the actual code that's run in production.
Overriding config
The Config.test
and Config.production
maps are defined as variables (with the var
keyword), so they can be overridden.
scala> import com.github.mrpowers.spark.spec.sql.Config
scala> Config.get("somethingElse")
res1: String = hi
scala> Config.test = Config.test ++ Map("somethingElse" -> "give me clean air")
scala> Config.get("somethingElse")
res2: String = give me clean air
Giving users the ability to swap out config on the fly makes your codebase more flexible for a variety of use cases.
Setting the PROJECT_ENV
variable for test runs
The Config
object uses the production environment by default. You're not going to want to have to remember to set the PROJECT_ENV
to test everytime you run your test suite (e.g. you don't want to type PROJECT_ENV=test sbt test
).
You can update your build.sbt
file as follows to set PROJECT_ENV
to test whenever the test suite is run.
fork in Test := true
envVars in Test := Map("PROJECT_ENV" -> "test")
Big thanks to the StackOverflow community for helping me figure this out.
Other implementations
This StackOverflow thread discusses other solutions.
One answer relies on an external library, one is in Java, and one doesn't allow for overrides. I will add an answer with the implementation discussed in this blog post now.
Next steps
Feel free to extend this solution to account for other environments. For example, you might want to add a staging environment that uses different paths to test code before it's run in production.
Just remember to follow best practices and avoid the config anti-pattern that can litter your codebase and reduce the protection offered by your test suite.
Adding Config
objects to your functions adds a dependency you might not want. In a future blog post, we'll discuss how dependency injection can abstract these Config
depencencies and how the Config
object can be leveraged to access smart defaults - the best of both worlds!