Renaming Multiple PySpark DataFrame columns (withColumnRenamed, select, toDF)

This blog post explains how to rename one or all of the columns in a PySpark DataFrame.

You’ll often want to rename columns in a DataFrame. Here are some examples:

remove all spaces from the DataFrame columns
convert all the columns to snake_case
replace the dots in column names with underscores

Lots of approaches to this problem are not scalable if you want to rename a lot of columns. Other solutions call withColumnRenamed a lot which may cause performance issues or cause StackOverflowErrors.

This blog post outlines solutions that are easy to use and create simple analysis plans, so the Catalyst optimizer doesn’t need to do hard optimization work.

Renaming a single column using withColumnRenamed

Renaming a single column is easy with withColumnRenamed. Suppose you have the following DataFrame:

+----------+------------+
|first_name|likes_soccer|
+----------+------------+
|      jose|        true|
+----------+------------+

You can rename the likes_soccer column to likes_football with this code:

df.withColumnRenamed("likes_soccer", "likes_football").show()

+----------+--------------+
|first_name|likes_football|
+----------+--------------+
|      jose|          true|
+----------+--------------+

withColumnRenamed can also be used to rename all the columns in a DataFrame, but that’s not a performant approach. Let’s look at how to rename multiple columns in a performant manner.

Renaming multiple columns

The quinn library has a with_columns_renamed function that renames all the columns in a DataFrame.

Suppose you have the following DataFrame:

+-------------+-----------+
|i like cheese|yummy stuff|
+-------------+-----------+
|         jose|          a|
|           li|          b|
|          sam|          c|
+-------------+-----------+

Here’s how to replace all the whitespace in the column names with underscores:

import quinn

def spaces_to_underscores(s):
    return s.replace(" ", "_")

actual_df = quinn.with_columns_renamed(spaces_to_underscores)(source_df)

actual_df.show()

+-------------+-----------+
|i_like_cheese|yummy_stuff|
+-------------+-----------+
|         jose|          a|
|           li|          b|
|          sam|          c|
+-------------+-----------+

This code generates an efficient parsed logical plan:

== Parsed Logical Plan ==
'Project ['`i.like.cheese` AS i_like_cheese#12, '`yummy.stuff` AS yummy_stuff#13]
+- LogicalRDD [i.like.cheese#8, yummy.stuff#9], false

== Analyzed Logical Plan ==
i_like_cheese: string, yummy_stuff: string
Project [i.like.cheese#8 AS i_like_cheese#12, yummy.stuff#9 AS yummy_stuff#13]
+- LogicalRDD [i.like.cheese#8, yummy.stuff#9], false

== Optimized Logical Plan ==
Project [i.like.cheese#8 AS i_like_cheese#12, yummy.stuff#9 AS yummy_stuff#13]
+- LogicalRDD [i.like.cheese#8, yummy.stuff#9], false

== Physical Plan ==
*(1) Project [i.like.cheese#8 AS i_like_cheese#12, yummy.stuff#9 AS yummy_stuff#13]
+- Scan ExistingRDD[i.like.cheese#8,yummy.stuff#9]

The parsed logical plan and the optimized logical plan are the same so the Spark Catalyst optimizer does not have to do any hard work.

with_columns_renamed takes two sets of arguments, so it can be chained with the DataFrame transform method. This code will give you the same result:

source_df.transform(quinn.with_columns_renamed(spaces_to_underscores))

The transform method is included in the PySpark 3 API. If you’re using Spark 2, you need to monkey patch transform onto the DataFrame class, as described in this the blog post.

with_columns_renamed source code

Here’s the source code for the with_columns_renamed method:

def with_columns_renamed(fun):
    def _(df):
        cols = list(map(
            lambda col_name: F.col("`{0}`".format(col_name)).alias(fun(col_name)),
            df.columns
        ))
        return df.select(*cols)
    return _

The code creates a list of the new column names and runs a single select operation. As you’ve already seen, this code generates an efficient parsed logical plan.

Notice that this code does not run withColumnRenamed multiple times, like other implementations suggest. Calling withColumnRenamed many times is a performance bottleneck.

toDF for renaming columns

Let’s use toDF to replace the whitespace with underscores (same objective, different implementation).

+-------------+-----------+
|i like cheese|yummy stuff|
+-------------+-----------+
|         jose|          a|
|           li|          b|
|          sam|          c|
+-------------+-----------+

Here’s how we can update the column names with toDF:

df.toDF(*(c.replace(' ', '_') for c in df.columns)).show()

+-------------+-----------+
|i_like_cheese|yummy_stuff|
+-------------+-----------+
|         jose|          a|
|           li|          b|
|          sam|          c|
+-------------+-----------+

This approach generates an efficient parsed plan.

== Parsed Logical Plan ==
Project [i.like.cheese#0 AS i_like_cheese#11, yummy.stuff#1 AS yummy_stuff#12]
+- LogicalRDD [i.like.cheese#0, yummy.stuff#1], false

== Analyzed Logical Plan ==
i_like_cheese: string, yummy_stuff: string
Project [i.like.cheese#0 AS i_like_cheese#11, yummy.stuff#1 AS yummy_stuff#12]
+- LogicalRDD [i.like.cheese#0, yummy.stuff#1], false

== Optimized Logical Plan ==
Project [i.like.cheese#0 AS i_like_cheese#11, yummy.stuff#1 AS yummy_stuff#12]
+- LogicalRDD [i.like.cheese#0, yummy.stuff#1], false

== Physical Plan ==
*(1) Project [i.like.cheese#0 AS i_like_cheese#11, yummy.stuff#1 AS yummy_stuff#12]

You could also wrap this code in a function and give it a method signature so it can be chained with the transform method.

withColumnRenamed antipattern when renaming multiple columns

You can call withColumnRenamed multiple times, but this isn’t a good solution because it creates a complex parsed logical plan.

Here the withColumnRenamed implementation:

def rename_cols(df):
    for column in df.columns:
        new_column = column.replace('.','_')
        df = df.withColumnRenamed(column, new_column)
    return df

rename_cols(df).explain(True)

Here are the logical plans:

== Parsed Logical Plan ==
Project [i_like_cheese#31, yummy.stuff#28 AS yummy_stuff#34]
+- Project [i.like.cheese#27 AS i_like_cheese#31, yummy.stuff#28]
   +- LogicalRDD [i.like.cheese#27, yummy.stuff#28], false

== Analyzed Logical Plan ==
i_like_cheese: string, yummy_stuff: string
Project [i_like_cheese#31, yummy.stuff#28 AS yummy_stuff#34]
+- Project [i.like.cheese#27 AS i_like_cheese#31, yummy.stuff#28]
   +- LogicalRDD [i.like.cheese#27, yummy.stuff#28], false

== Optimized Logical Plan ==
Project [i.like.cheese#27 AS i_like_cheese#31, yummy.stuff#28 AS yummy_stuff#34]
+- LogicalRDD [i.like.cheese#27, yummy.stuff#28], false

== Physical Plan ==
*(1) Project [i.like.cheese#27 AS i_like_cheese#31, yummy.stuff#28 AS yummy_stuff#34]

The parsed and analyzed logical plans are more complex than what we’ve seen before. In this case, the Spark Catalyst optimizer is smart enough to perform optimizations and generate the same optimized logical plan, but parsing logical plans takes time as described in this blog post. It’s best to write code that’s easy for Catalyst to optimize. When you can, write code that has simple parsed logical plans.

Renaming some columns

Suppose you have the following DataFrame with column names that use British English. You’d like to convert these column names to American English (change chips to french_fries and petrol to gas). You don’t want to rename or remove columns that aren’t being remapped to American English – you only want to change certain column names.

+------+-----+------+
| chips|   hi|petrol|
+------+-----+------+
|potato|hola!| disel|
+------+-----+------+

The quinn with_some_columns_renamed function makes it easy to rename some columns.

import quinn

mapping = {"chips": "french_fries", "petrol": "gas"}

def british_to_american(s):
    return mapping[s]

def change_col_name(s):
    return s in mapping

actual_df = quinn.with_some_columns_renamed(british_to_american, change_col_name)(source_df)

actual_df.show()

+------------+-----+-----+
|french_fries|   hi|  gas|
+------------+-----+-----+
|      potato|hola!|disel|
+------------+-----+-----+

This code generates an efficient parsed logical plan:

== Parsed Logical Plan ==
'Project ['chips AS french_fries#32, unresolvedalias('hi, None), 'petrol AS gas#33]
+- LogicalRDD [chips#16, hi#17, petrol#18], false

== Analyzed Logical Plan ==
french_fries: string, hi: string, gas: string
Project [chips#16 AS french_fries#32, hi#17, petrol#18 AS gas#33]
+- LogicalRDD [chips#16, hi#17, petrol#18], false

== Optimized Logical Plan ==
Project [chips#16 AS french_fries#32, hi#17, petrol#18 AS gas#33]
+- LogicalRDD [chips#16, hi#17, petrol#18], false

== Physical Plan ==
*(1) Project [chips#16 AS french_fries#32, hi#17, petrol#18 AS gas#33]

The with_some_columns_renamed function takes two arguments:

The first argument is a function specifies how the strings should be modified
The second argument is a function that returns True if the string should be modified and False otherwise

Replacing dots with underscores in column names

You should always replace dots with underscores in PySpark column names, as explained in this post.

You can apply the methodologies you’ve learned in this blog post to easily replace dots with underscores.

Next steps

It’s important to write code that renames columns efficiently in Spark. You’ll often start an analysis by read from a datasource and renaming the columns. If you use an inefficient renaming implementation, you’re parsed logical plan will start out complex and will only get more complicated as you layer on more DataFrame transformations.

Complicated parsed logical plans are difficult for the Catalyst optimizer to optimize.

Make sure to read this blog post on chaining DataFrame transformations, so you’re comfortable renaming columns with a function that’s passed to the DataFrame#transform method. Writing elegant PySpark code will help you keep your notebooks clean and easy to read.

MungingData

Piles of precious data