PySpark Archives - MungingData

Install PySpark, Delta Lake, and Jupyter Notebooks on Mac with conda

mrpowers June 1, 2022 0

This blog post explains how to install PySpark, Delta Lake, and Jupyter Notebooks on a Mac. This setup will let you easily run Delta Lake computations on your local machine […]

Working with PySpark ArrayType Columns

mrpowers June 28, 2021 0

This post explains how to create DataFrames with ArrayType columns and how to perform common data processing operations. Array columns are one of the most useful column types, but they’re […]

Defining PySpark Schemas with StructType and StructField

mrpowers June 26, 2021 0

This post explains how to define PySpark schemas and when this design pattern is useful. It’ll also explain when defining schemas seems wise, but can actually be safely avoided. Schemas […]

Adding constant columns with lit and typedLit to PySpark DataFrames

mrpowers June 22, 2021 0

This post explains how to add constant columns to PySpark DataFrames with lit and typedLit. You’ll see examples where these functions are useful and when these functions are invoked implicitly. […]

Navigating None and null in PySpark

mrpowers June 21, 2021 0

This blog post shows you how to gracefully handle null in PySpark and how to avoid null input errors. Mismanaging the null case is a common source of errors and […]

Creating and reusing the SparkSession with PySpark

mrpowers June 19, 2021 0

This post explains how to create a SparkSession with getOrCreate and how to reuse the SparkSession with getActiveSession. You need a SparkSession to read data stored in files, when manually […]

select and add columns in PySpark

mrpowers May 6, 2021 0

This post shows you how to select a subset of the columns in a DataFrame with select. It also shows how select can be used to add and rename columns. […]

Filtering PySpark Arrays and DataFrame Array Columns

mrpowers May 4, 2021 0

This post explains how to filter values from a PySpark array column. It also explains how to filter DataFrames with array columns (i.e. reduce the number of rows in a […]

Combining PySpark DataFrames with union and unionByName

mrpowers May 4, 2021 0

Multiple PySpark DataFrames can be combined into a single DataFrame with union and unionByName. union works when the columns of both DataFrames being joined are in the same order. It […]

Combining PySpark arrays with concat, union, except and intersect

mrpowers May 1, 2021 0

This post shows the different ways to combine multiple PySpark arrays into a single array. These operations were difficult prior to Spark 2.4, but now there are built-in functions that […]