rspec/rules/S7191/python/rule.adoc

This rule identifies instances where multiple `withColumn` calls are used in succession to add or modify columns in a PySpark DataFrame. It suggests using `withColumns` instead, which is more efficient and concise.

== Why is this an issue?

Using `withColumn` multiple times can lead to inefficient code, as each call creates a new Spark Logical Plan. withColumns allows for adding or modifying multiple columns in a single operation, improving performance.

=== What is the potential impact?

Creating a new column can be a costly operation, as Spark has to loop on every row to compute the new column value.

=== Exceptions

`withColumn` can be used multiple times sequentially on a Dataframe when computing consecutive columns requires the presence of the previous ones.
In this case, consecutive `withColumn` calls are a solution.

[source,python,diff-id=1,diff-type=compliant]
----
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([[1,2],[2,3]], ["id", "value"])
df_with_new_cols = df.withColumn("squared_value", col("value") * col("value")).withColumn("cubic_value", col("squared_value") * col("value")) # Compliant
----

== How to fix it
To fix this issue, `withColumns` method should be used instead of multiple consecutive calls to `withColumn` method.

=== Code examples

==== Noncompliant code example

[source,python,diff-id=1,diff-type=noncompliant]
----
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([[1,2],[2,3]], ["id", "value"])
df_with_new_cols = df.withColumn("value_plus_1", col("value") + 1).withColumn("value_plus_2", col("value") + 2).withColumn("value_plus_3", col("value") + 3) # Noncompliant
----

==== Compliant solution

[source,python,diff-id=1,diff-type=compliant]
----
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.getOrCreate()
df = spark.createDataFrame([[1,2],[2,3]], ["id", "value"])
df_with_new_cols = df.withColumns({    # Compliant
    "value_plus_1": col("value") + 1,
    "value_plus_2": col("value") + 2,
    "value_plus_3": col("value") + 3,
})
----

== Resources
=== Documentation

 * PySpark withColumn Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withColumn.html[pyspark.sql.DataFrame.withColumn]
 * PySpark withColumns Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withColumns.html[pyspark.sql.DataFrame.withColumns]

=== Articles & blog posts

 * Medium blog - https://blog.devgenius.io/why-to-avoid-multiple-chaining-of-withcolumn-function-in-spark-job-35ee8e09daaa[Why to avoid multiple chaining of withColumn() function in Spark job.]