Apache Spark Archives - Page 2 of 6

Expressively Typed Spark Datasets with Frameless

mrpowers October 26, 2020 0

frameless is a great library for writing Datasets with expressive types. The library helps users write correct code with descriptive compile time errors instead of runtime errors with long stack […]

Writing out single files with Spark (CSV or Parquet)

mrpowers June 18, 2020 0

This blog explains how to write out a DataFrame to a single file with Spark. It also describes how to write out data in a file with a specific name, […]

Important Considerations when filtering in Spark with filter and where

mrpowers April 20, 2020 8

This blog post explains how to filter in Spark and discusses the vital factors to consider when filtering. Poorly executed filtering operations are a common bottleneck in Spark analyses. You […]

Executing Spark code with expr and eval

mrpowers March 28, 2020 2

You can execute Spark column functions with a genius combination of expr and eval(). This technique lets you execute Spark functions without having to create a DataFrame. This makes it […]

Spark Column Equality

mrpowers March 10, 2020 0

The term “column equality” refers to two different things in Spark: When a column is equal to a particular value (typically when filtering) When all the values in two columns […]

Working with Spark MapType Columns

mrpowers January 15, 2020 0

Spark DataFrame columns support maps, which are great for key / value pairs with an arbitrary length. This blog post describes how to create MapType columns, demonstrates built-in functions to […]

Designing Scala Packages and Imports for Readable Spark Code

mrpowers November 24, 2019 0

This blog post explains how to import core Spark and Scala libraries like spark-daria into your projects. It’s important for library developers to organize package namespaces so it’s easy for […]

Best Apache Spark Books

mrpowers November 13, 2019 0

Apache Spark is a big data engine that has quickly become one of the biggest distributed processing frameworks in the world. It’s used by all the big financial institutions and […]

Using HyperLogLog for count distinct computations with Spark

mrpowers November 4, 2019 9

This blog post explains how to use the HyperLogLog algorithm to perform fast count distinct operations. HyperLogLog sketches can be generated with spark-alchemy, loaded into Postgres databases, and queried with […]

Partitioning on Disk with partitionBy

mrpowers October 19, 2019 7

Spark writers allow for data to be partitioned on disk with partitionBy. Some queries can run 50 to 100 times faster on a partitioned data lake, so partitioning is vital […]

MungingData

Piles of precious data

Apache Spark