Avoiding Dots / Periods in PySpark Column Names
Dots / periods in PySpark column names need to be escaped with backticks which is tedious and error-prone. This blog post explains the errors and bugs you’re likely to see […]
Dots / periods in PySpark column names need to be escaped with backticks which is tedious and error-prone. This blog post explains the errors and bugs you’re likely to see […]
Python dictionaries are stored in PySpark map columns (the pyspark.sql.types.MapType class). This blog post explains how to convert a map into multiple columns. You’ll want to break up a map […]
This blog post explains how to rename one or all of the columns in a PySpark DataFrame. You’ll often want to rename columns in a DataFrame. Here are some examples: […]
This blog explains how to write out a DataFrame to a single file with Spark. It also describes how to write out data in a file with a specific name, […]
This blog post explains how to test PySpark code with the chispa helper library. Writing fast PySpark tests that provide your codebase with adequate coverage is surprisingly easy when you […]
This blog post explains how to create a PySpark project with Poetry, the best Python dependency management system. It’ll also explain how to package PySpark projects as wheel files, so […]
Adding executables to your PATH is fun, easy, and a great way to learn about how your machine works. This pattern is especially useful for long commands that you need […]
This blog post explains how to filter in Spark and discusses the vital factors to consider when filtering. Poorly executed filtering operations are a common bottleneck in Spark analyses. You […]
pyenv lets you manage multiple versions of Python on your computer. This blog post focuses on how pyenv uses the shim design pattern to provide a wonderful user experience (it […]
Spark 3 has new array functions that make working with ArrayType columns much easier. Spark developers previously needed to use UDFs to perform complicated array functions. The new Spark functions […]