rspec/rules/S7191/python/rule.adoc
github-actions[bot] ae4e0661ea
Create rule S7191: PySpark "withColumns" should be preferred over "withColumn" when multiple columns are specified (#4633)
* Create rule S7191

* Create rule S7191: PySpark  should be preferred over  when multiple columns are specified

* Fix after review

---------

Co-authored-by: thomas-serre-sonarsource <thomas-serre-sonarsource@users.noreply.github.com>
Co-authored-by: Thomas Serre <thomas.serre@sonarsource.com>
Co-authored-by: Guillaume Dequenne <guillaume.dequenne@sonarsource.com>
2025-02-19 17:06:04 +00:00

65 lines
2.8 KiB
Plaintext

This rule identifies instances where multiple `withColumn` calls are used in succession to add or modify columns in a PySpark DataFrame. It suggests using `withColumns` instead, which is more efficient and concise.
== Why is this an issue?
Using `withColumn` multiple times can lead to inefficient code, as each call creates a new Spark Logical Plan. withColumns allows for adding or modifying multiple columns in a single operation, improving performance.
=== What is the potential impact?
Creating a new column can be a costly operation, as Spark has to loop on every row to compute the new column value.
=== Exceptions
`withColumn` can be used multiple times sequentially on a Dataframe when computing consecutive columns requires the presence of the previous ones.
In this case, consecutive `withColumn` calls are a solution.
[source,python,diff-id=1,diff-type=compliant]
----
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([[1,2],[2,3]], ["id", "value"])
df_with_new_cols = df.withColumn("squared_value", col("value") * col("value")).withColumn("cubic_value", col("squared_value") * col("value")) # Compliant
----
== How to fix it
To fix this issue, `withColumns` method should be used instead of multiple consecutive calls to `withColumn` method.
=== Code examples
==== Noncompliant code example
[source,python,diff-id=1,diff-type=noncompliant]
----
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([[1,2],[2,3]], ["id", "value"])
df_with_new_cols = df.withColumn("value_plus_1", col("value") + 1).withColumn("value_plus_2", col("value") + 2).withColumn("value_plus_3", col("value") + 3) # Noncompliant
----
==== Compliant solution
[source,python,diff-id=1,diff-type=compliant]
----
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([[1,2],[2,3]], ["id", "value"])
df_with_new_cols = df.withColumns({ # Compliant
"value_plus_1": col("value") + 1,
"value_plus_2": col("value") + 2,
"value_plus_3": col("value") + 3,
})
----
== Resources
=== Documentation
* PySpark withColumn Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withColumn.html[pyspark.sql.DataFrame.withColumn]
* PySpark withColumns Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withColumns.html[pyspark.sql.DataFrame.withColumns]
=== Articles & blog posts
* Medium blog - https://blog.devgenius.io/why-to-avoid-multiple-chaining-of-withcolumn-function-in-spark-job-35ee8e09daaa[Why to avoid multiple chaining of withColumn() function in Spark job.]