This book will teach you how to be a proficient Apache Spark programmer with minimal effort.
Other books focus on the theoretical underpinnings of Spark. This book skips the theory and only covers the practical knowledge needed to write production grade Spark code.
Table of Contents
- Using the Spark shell
- Introduction to DataFrames
- Just enough Scala for Spark programmers
- Column methods
- SQL functions
- Creating Spark DataFrames
- Using null in Spark DataFrames
- User defined functions
- Chaining DataFrame transformations
- Schema independent DataFrame transformations
- Dates / times (DateType, TimestampType)
- Defining DataFrame Schemas with StructField and StructType
- ArrayType columns
- MapType columns
- Managing the SparkSession, The DataFrame Entry Point
- CSV file format
- Parquet file format
- JSON file format
- Managing partitions with repartition and coalesce
- Different types of DataFrame joins
- Broadcast joins
- Introduction to SBT for Spark programmers
- Building Spark JAR files
- Testing with uTest
- Window functions
- Dependency injection
- Best practices
- Logistic Regressions
- Extending core classes
- Customizing logs
Scala programming language
Scala is a vast, multi-paradigm programming language that’s notoriously difficult to learn.
Scala can be used as a functional language, an object oriented language, or a mix of both.
You only need to know a small fraction of the Scala programming langauge to be a productive Spark developer. You need to know how to write Scala functions, define packages, and import namespaces. Complex Scala programming topics can be ignored completely.
I knew two programmers that began their Spark learning journey by taking the Functional Programming Principles in Scala course by Martin Odersky. Odersky created the Scala programming language and is incredibly intelligent. His course is amazing, but very hard, so my friends felt intimidated by Spark. How could they possibly learn Spark if they could barely make it through a beginners course on Scala?
As it turns out, Spark programmers don’t need to know anything about advanced Scala language features or functional programming, so courses like Functional Programming Principles in Scala are complete overkill.
Check out the Just enough Scala for Spark programmers post to see just how little Scala is necessary for Spark.
PySpark or Scala?
Choosing between Python and Scala would normally be a big technology decision (e.g. if you were building a web application), but it’s not as important for Spark because the APIs are so similar.
Let’s look at some Scala code that adds a column to a DataFrame:
import org.apache.spark.sql.functions._ df.withColumn("greeting", lit("hi"))
Here’s the same code in Python:
from pyspark.sql.functions import lit df.withColumn("greeting", lit("hi"))
PySpark code looks a lot like Scala code.
The community is shifting towards PySpark so that’s a good place to get started, but it’s not a mission critical decision. It’s all compiled to Spark at the end of the day!
Running Spark code
Spinning up your own Spark clusters is complicated. You need to install Spark, create the driver node, create the worker nodes, and make sure messages are properly being sent across machines in the cluster.
Databricks let’s you easily spin up a cluster with Spark installed, so you don’t need to worry about provisioning packages or cluster management. Databricks also has cool features like autoscaling clusters.
If you’re just getting started with Spark, it’s probably best to pay a bit more and use Databricks.
Theoretical stuff to ignore
Spark is a big data engine that’s built on theoretical cluster computing principles.
You don’t need to know how Spark works to solve problems with Spark!
Most books cover a lot of theoretical Spark before teaching the practical basics.
This book aims to provide the easiest possible introduction to Apache Spark by starting with the practical basics. Enjoy!