Combining PySpark arrays with concat, union, except and intersect

This post shows the different ways to combine multiple PySpark arrays into a single array.

These operations were difficult prior to Spark 2.4, but now there are built-in functions that make combining arrays easy.

concat

concat joins two array columns into a single array.

Creating a DataFrame with two array columns so we can demonstrate with an example.

df = spark.createDataFrame(
    [(["a", "a", "b", "c"], ["c", "d"])], ["arr1", "arr2"]
)
df.show()

+------------+------+
|        arr1|  arr2|
+------------+------+
|[a, a, b, c]|[c, d]|
+------------+------+

Concatenate the two arrays with concat:

res = df.withColumn("arr_concat", concat(col("arr1"), col("arr2")))
res.show()

+------------+------+------------------+
|        arr1|  arr2|        arr_concat|
+------------+------+------------------+
|[a, a, b, c]|[c, d]|[a, a, b, c, c, d]|
+------------+------+------------------+

Notice that arr_concat contains duplicate values.

We can remove the duplicates with array_distinct:

df.withColumn(
    "arr_concat_distinct", array_distinct(concat(col("arr1"), col("arr2")))
).show()

+------------+------+-------------------+
|        arr1|  arr2|arr_concat_distinct|
+------------+------+-------------------+
|[a, a, b, c]|[c, d]|       [a, b, c, d]|
+------------+------+-------------------+

Let’s look at another way to return a distinct concatenation of two arrays that isn’t as verbose.

array_union

array_union combines two arrays, without any duplicates.

res = df.withColumn("arr_union", array_union(col("arr1"), col("arr2")))
res.show()

+------------+------+------------+
|        arr1|  arr2|   arr_union|
+------------+------+------------+
|[a, a, b, c]|[c, d]|[a, b, c, d]|
+------------+------+------------+

We can get the same result by nesting concat in array_distinct, but that’s less efficient and unnecessarily verbose.

array_intersect

array_intersect returns the elements that are in both arrays.

res = df.withColumn("arr_intersect", array_intersect(col("arr1"), col("arr2")))
res.show()

+------------+------+-------------+
|        arr1|  arr2|arr_intersect|
+------------+------+-------------+
|[a, a, b, c]|[c, d]|          [c]|
+------------+------+-------------+

In our example, c is the only element that’s in both arr1 and arr2.

array_except

array_except returns a distinct list of the elements that are in arr1, but not in arr2.

res = df.withColumn("arr_except", array_except(col("arr1"), col("arr2")))
res.show()

+------------+------+----------+
|        arr1|  arr2|arr_except|
+------------+------+----------+
|[a, a, b, c]|[c, d]|    [a, b]|
+------------+------+----------+

a and b are in arr1 and not in arr2.

Conclusion

Several functions were added in PySpark 2.4 that make it significantly easier to work with array columns.

Earlier versions of Spark required you to write UDFs to perform basic array functions which was tedious.

This post doesn’t cover all the important array functions. Make sure to also learn about the exists and forall functions and the transform / filter functions.

You’ll be a PySpark array master once you’re comfortable with these functions.

MungingData

Piles of precious data