Writing Beautiful Apache Spark 2 Code with Scala

Apache Spark is a powerful big data engine that can process massive datasets.

Beautiful Spark Code will teach you how to harness the power of Spark quickly.

Writing reusable, testable, and maintainable Spark code can be hard…

Existing Spark books and tutorials don’t teach you how to write great Spark code… they’re focused on explaining the API and theoretical cluster computing concepts.

This book is different and will give you the following skills:

How to wrap Spark code in Scala functions
The minimum amount of Scala to be productive with Spark
Manipulating DataFrames
Structuring private libraries and applications

Get the book here.

DataFrames only

This book only covers the DataFrame API and does not discuss RDDs. DataFrames were introduced in Spark 2 and are almost always easier to work with than RDDs. This book doesn’t bog you down with lower level APIs.

What about PySpark?

PySpark is very different than Scala Spark.

It’s annoying when books try to cover both PySpark and the Scala API because all readers are forced to look at a langauge they don’t care about.

Entire chapters of this book are completely irrelevant to PySpark users (e.g. building JAR files and chaining funcitons with the DataFrame transform() method).

Click here if you’re interested in a book on writing beautiful PySpark code – I will write the book if there are enough interested readers.

Methods are given context

Other Spark books introduce functionality and explain how to perform certain actions without giving context and explaining WHY the functionality is important or WHEN it should be applied.

Most Spark books will introduce the coalesce() method, explain how many arguments it takes, and show you how to invoke the function. They don’t explain why coalesce() is really important and when it should be used (after filtering a big DataFrame into a smaller one).

Books should not be API documentation

Spark has wonderful programatic documentation that you’ll refer to all the time as a Spark developer.

Some Spark books read like a narrative on the programatic API docs.

This book explains the most important parts of the Spark API and provides context.

All book examples are self contained

Some Spark books rely on large external datasets, which can be annoying.

All examples in this book are self contained. You don’t need to work through broken links to data files or wait for a huge zip file to get downloaded to get through this book.

All code is organized in a GitHub repo

All the code is in a neatly organized, tested GitHub repo that follows best Spark programming practices.

All code snippets in the book are easy to reproduce on your local machine.

Doesn’t rely on external code libraries

Some Spark books use examples that require external dependencies like Hive or YARN. This book does not have any external dependencies other than Spark.

Doesn’t cover the streaming or machine learning APIs

Spark has SQL, RDD, streaming and machine learning APIs. Each API is complex and requires a standalone book. Advanced Analytics with Spark is a book that’s entirely dedicated to Spark machine learning algorithms for example.

Books that cover all the APIs are typically huge and difficult to follow. It’s just too much content for a single book.

This book only covers the Spark SQL API. Writing beautiful Spark SQL code is a prerequisite for writing great streaming and machine learning code in any case. This book will set you down a good path.

MungingData

Piles of precious data