Publishing Spark Projects with JitPack
JitPack is a package repository that provides easy access to your Spark projects that are checked into GitHub. JitPack is easier to use than Maven for open source projects and […]
JitPack is a package repository that provides easy access to your Spark projects that are checked into GitHub. JitPack is easier to use than Maven for open source projects and […]
Logistic regression models are a powerful way to predict binary outcomes (e.g. winning a game or surviving a shipwreck). Multiple explanatory variables (aka “features”) are used to train the model […]
Environment config files return different values for the test, development, staging, and production environments. In Spark projects, you will often want a variable to point to a local CSV file […]
Spark JAR files let you package a project into a single file so it can be run on a Spark cluster. A lot of developers develop Spark code in brower […]
The Spark rlike method allows you to write powerful string matching algorithms with regular expressions (regexp). This blog post will outline tactics to detect strings that match multiple different patterns […]
The spark-slack library can be used to speak notifications to Slack from your Spark programs and handle Slack Slash command responses. You can speak Slack notifications to alert stakeholders when […]
The uTest Scala testing framework can be used to elegantly test your Spark code. The other popular Scala testing frameworks (Scalatest and Specs2) provide multiple different ways to solve the […]
PySpark code should generally be organized as single purpose DataFrame transformations that can be chained together for production analyses (e.g. generating a datamart). This blog post demonstrates how to monkey […]
implicit classes or the Dataset#transform method can be used to chain DataFrame transformations in Spark. This blog post will demonstrate how to chain DataFrame transformations and explain why the Dataset#transform […]