select and add columns in PySpark

This post shows you how to select a subset of the columns in a DataFrame with select. It also shows how select can be used to add and rename columns. Most PySpark users don’t know how to truly harness the power of select.

This post also shows how to add a column with withColumn. Newbie PySpark developers often run withColumn multiple times to add multiple columns because there isn’t a withColumns method. We will see why chaining multiple withColumn calls is an anti-pattern and how to avoid this pattern with select.

This post starts with basic use cases and then advances to the lesser-known, powerful applications of these methods.

select basic use case

Create a DataFrame with three columns.

df = spark.createDataFrame(
    [("jose", 1, "mexico"), ("li", 2, "china"), ("sandy", 3, "usa")],
    ["name", "age", "country"],
)
df.show()
+-----+---+-------+
| name|age|country|
+-----+---+-------+
| jose|  1| mexico|
|   li|  2|  china|
|sandy|  3|    usa|
+-----+---+-------+

Select the age and name columns:

df.select("age", "name").show()
+---+-----+
|age| name|
+---+-----+
|  1| jose|
|  2|   li|
|  3|sandy|
+---+-----+

The select method takes column names as arguments.

If you try to select a column that doesn’t exist in the DataFrame, your code will error out. Here’s the error you’ll see if you run df.select("age", "name", "whatever").

def deco(*a, **kw):
    try:
        return f(*a, **kw)
    except py4j.protocol.Py4JJavaError as e:
        converted = convert_exception(e.java_exception)
        if not isinstance(converted, UnknownException):
            # Hide where the exception came from that shows a non-Pythonic
            # JVM exception message.
           raise converted from None
           pyspark.sql.utils.AnalysisException: cannot resolve '`whatever`' given input columns: [age, country, name];
           'Project [age#77L, name#76, 'whatever]
           +- LogicalRDD [name#76, age#77L, country#78], false

Get used to parsing PySpark stack traces!

The select method can also take an array of column names as the argument.

df.select(["country", "name"]).show()
+-------+-----+
|country| name|
+-------+-----+
| mexico| jose|
|  china|   li|
|    usa|sandy|
+-------+-----+

You can also select based on an array of column objects:

df.select([col("age")]).show()
+---+
|age|
+---+
|  1|
|  2|
|  3|
+---+

Keep reading to see how selecting on an array of column object allows for advanced use cases, like renaming columns.

withColumn basic use case

withColumn adds a column to a DataFrame.

Create a DataFrame with two columns:

df = spark.createDataFrame(
    [("jose", 1), ("li", 2), ("luisa", 3)], ["name", "age"]
)
df.show()
+-----+---+
| name|age|
+-----+---+
| jose|  1|
|   li|  2|
|luisa|  3|
+-----+---+

Append a greeting column to the DataFrame with the string hello:

df.withColumn("greeting", lit("hello")).show()
+-----+---+--------+
| name|age|greeting|
+-----+---+--------+
| jose|  1|   hello|
|   li|  2|   hello|
|luisa|  3|   hello|
+-----+---+--------+

Now let’s use withColumn to append an upper_name column that uppercases the name column.

df.withColumn("upper_name", upper(col("name"))).show()
+-----+---+----------+
| name|age|upper_name|
+-----+---+----------+
| jose|  1|      JOSE|
|   li|  2|        LI|
|luisa|  3|     LUISA|
+-----+---+----------+

withColumn is often used to append columns based on the values of other columns.

Add multiple columns (withColumns)

There isn’t a withColumns method, so most PySpark newbies call withColumn multiple times when they need to add multiple columns to a DataFrame.

Create a simple DataFrame:

df = spark.createDataFrame(
    [("cali", "colombia"), ("london", "uk")],
    ["city", "country"],
)
df.show()
+------+--------+
|  city| country|
+------+--------+
|  cali|colombia|
|london|      uk|
+------+--------+

Here’s how to append two columns with constant values to the DataFrame using select:

actual = df.select(["*", lit("val1").alias("col1"), lit("val2").alias("col2")])
actual.show()
+------+--------+----+----+
|  city| country|col1|col2|
+------+--------+----+----+
|  cali|colombia|val1|val2|
|london|      uk|val1|val2|
+------+--------+----+----+

The * selects all of the existing DataFrame columns and the other columns are appended. This design pattern is how select can append columns to a DataFrame, just like withColumn.

The code is a bit verbose, but it’s better than the following code that calls withColumn multiple times:

df.withColumn("col1", lit("val1")).withColumn("col2", lit("val2"))

There is a hidden cost of withColumn and calling it multiple times should be avoided.

The Spark contributors are considering adding withColumns to the API, which would be the best option. That’d give the community a clean and performant way to add multiple columns.

Snake case all columns

Create a DataFrame with annoyingly named columns:

annoying = spark.createDataFrame(
    [(3, "mystery"), (23, "happy")],
    ["COOL NUMBER", "RELATED EMOTION"],
)
annoying.show()
+-----------+---------------+
|COOL NUMBER|RELATED EMOTION|
+-----------+---------------+
|          3|        mystery|
|         23|          happy|
+-----------+---------------+

Gross.

Write some code that’ll convert all the column names to snake_case:

def to_snake_case(s):
    return s.lower().replace(" ", "_")

cols = [col(s).alias(to_snake_case(s)) for s in annoying.columns]
annoying.select(cols).show()
+-----------+---------------+
|cool_number|related_emotion|
+-----------+---------------+
|          3|        mystery|
|         23|          happy|
+-----------+---------------+

Some DataFrames have hundreds or thousands of columns, so it’s important to know how to rename all the columns programatically with a loop, followed by a select.

Remove dots from all column names

Create a DataFrame with dots in the column names:

annoying = spark.createDataFrame(
    [(3, "mystery"), (23, "happy")],
    ["cool.number", "related.emotion"],
)
annoying.show()
+-----------+---------------+
|cool.number|related.emotion|
+-----------+---------------+
|          3|        mystery|
|         23|          happy|
+-----------+---------------+

Remove the dots from the column names and replace them with underscores.

cols = [col("`" + s + "`").alias(s.replace(".", "_")) for s in annoying.columns]
annoying.select(cols).show()
+-----------+---------------+
|cool_number|related_emotion|
+-----------+---------------+
|          3|        mystery|
|         23|          happy|
+-----------+---------------+

Notice that this code hacks in backticks around the column name or else it’ll error out (simply calling col(s) will cause an error in this case). These backticks are needed whenever the column name contains periods. Super annoying.

You should never have dots in your column names as discussed in this post. Dots in column names cause weird bugs. Always get rid of dots in column names whenever you see them.

Conclusion

The select method can be used to grab a subset of columns, rename columns, or append columns. It’s a powerful method that has a variety of applications.

withColumn is useful for adding a single column. It shouldn’t be chained when adding multiple columns (fine to chain a few times, but shouldn’t be chained hundreds of times). You now know how to append multiple columns with select, so you can avoid chaining withColumn calls.

Hopefully withColumns is added to the PySpark codebase so it’s even easier to add multiple columns.

Registration

Comments are closed, but trackbacks and pingbacks are open.