MungingData - Page 9 of 13 - Piles of precious data

Creating an IT Ticketing System with GitHub and Slack

mrpowers December 7, 2019 0

GitHub is great for an IT ticketing system – it’s easy to create issues, set assignees, and prioritize the ticket with labels. The GitHub Slack integration makes it easy to […]

Designing Scala Packages and Imports for Readable Spark Code

mrpowers November 24, 2019 0

This blog post explains how to import core Spark and Scala libraries like spark-daria into your projects. It’s important for library developers to organize package namespaces so it’s easy for […]

Best Apache Spark Books

mrpowers November 13, 2019 0

Apache Spark is a big data engine that has quickly become one of the biggest distributed processing frameworks in the world. It’s used by all the big financial institutions and […]

Using HyperLogLog for count distinct computations with Spark

mrpowers November 4, 2019 9

This blog post explains how to use the HyperLogLog algorithm to perform fast count distinct operations. HyperLogLog sketches can be generated with spark-alchemy, loaded into Postgres databases, and queried with […]

Delta Lake schema enforcement and evolution with mergeSchema and overwriteSchema

mrpowers October 25, 2019 6

Delta lakes prevent data with incompatible schema from being written, unlike Parquet lakes which allow for any data to get written. Let’s demonstrate how Parquet allows for files with incompatible […]

Selectively updating Delta partitions with replaceWhere

mrpowers October 23, 2019 0

Delta makes it easy to update certain disk partitions with the replaceWhere option. Selectively applying updates to certain partitions isn’t always possible (sometimes the entire lake needs the update), but […]

Partitioning on Disk with partitionBy

mrpowers October 19, 2019 7

Spark writers allow for data to be partitioned on disk with partitionBy. Some queries can run 50 to 100 times faster on a partitioned data lake, so partitioning is vital […]

Using Delta lake merge to update columns and perform upserts

mrpowers September 22, 2019 0

This blog posts explains how to update a table column and perform upserts with the merge command. We explain how to use the merge command and what the command does […]

Vacuuming Delta Lakes

mrpowers September 17, 2019 1

Delta lakes are versioned so you can easily revert to old versions of the data. In some instances, Delta lake needs to store multiple versions of the data to enable […]

Compacting Small Files in Delta Lakes

mrpowers September 5, 2019 1

This post explains how to compact small files in Delta lakes with Spark. Data lakes can accumulate a lot of small files, especially when they’re incrementally updated. Small files cause […]