2020 - Page 3 of 5 - MungingData

Fetching Random Values from PySpark Arrays / Columns

mrpowers July 26, 2020 0

This post shows you how to fetch a random value from a PySpark array or from a set of columns. It’ll also show you how to add a column to […]

Building DAGs / Directed Acyclic Graphs with Python

mrpowers July 25, 2020 0

Directed Acyclic Graphs (DAGs) are a critical data structure for data science / data engineering workflows. DAGs are used extensively by popular projects like Apache Airflow and Apache Spark. This […]

Avoiding Dots / Periods in PySpark Column Names

mrpowers July 22, 2020 0

Dots / periods in PySpark column names need to be escaped with backticks which is tedious and error-prone. This blog post explains the errors and bugs you’re likely to see […]

Converting a PySpark Map / Dictionary to Multiple Columns

mrpowers July 22, 2020 2

Python dictionaries are stored in PySpark map columns (the pyspark.sql.types.MapType class). This blog post explains how to convert a map into multiple columns. You’ll want to break up a map […]

Renaming Multiple PySpark DataFrame columns (withColumnRenamed, select, toDF)

mrpowers July 19, 2020 0

This blog post explains how to rename one or all of the columns in a PySpark DataFrame. You’ll often want to rename columns in a DataFrame. Here are some examples: […]

Writing out single files with Spark (CSV or Parquet)

mrpowers June 18, 2020 0

This blog explains how to write out a DataFrame to a single file with Spark. It also describes how to write out data in a file with a specific name, […]

Testing PySpark Code

mrpowers June 13, 2020 1

This blog post explains how to test PySpark code with the chispa helper library. Writing fast PySpark tests that provide your codebase with adequate coverage is surprisingly easy when you […]

PySpark Dependency Management and Wheel Packaging with Poetry

mrpowers June 1, 2020 5

This blog post explains how to create a PySpark project with Poetry, the best Python dependency management system. It’ll also explain how to package PySpark projects as wheel files, so […]

Adding executables to your PATH for fun

mrpowers April 21, 2020 1

Adding executables to your PATH is fun, easy, and a great way to learn about how your machine works. This pattern is especially useful for long commands that you need […]

Important Considerations when filtering in Spark with filter and where

mrpowers April 20, 2020 8

This blog post explains how to filter in Spark and discusses the vital factors to consider when filtering. Poorly executed filtering operations are a common bottleneck in Spark analyses. You […]