![github-actions[bot]](/assets/img/avatar_default.png)
* Create rule S7191 * Create rule S7191: PySpark should be preferred over when multiple columns are specified * Fix after review --------- Co-authored-by: thomas-serre-sonarsource <thomas-serre-sonarsource@users.noreply.github.com> Co-authored-by: Thomas Serre <thomas.serre@sonarsource.com> Co-authored-by: Guillaume Dequenne <guillaume.dequenne@sonarsource.com>
65 lines
2.8 KiB
Plaintext
65 lines
2.8 KiB
Plaintext
This rule identifies instances where multiple `withColumn` calls are used in succession to add or modify columns in a PySpark DataFrame. It suggests using `withColumns` instead, which is more efficient and concise.
|
|
|
|
== Why is this an issue?
|
|
|
|
Using `withColumn` multiple times can lead to inefficient code, as each call creates a new Spark Logical Plan. withColumns allows for adding or modifying multiple columns in a single operation, improving performance.
|
|
|
|
=== What is the potential impact?
|
|
|
|
Creating a new column can be a costly operation, as Spark has to loop on every row to compute the new column value.
|
|
|
|
=== Exceptions
|
|
|
|
`withColumn` can be used multiple times sequentially on a Dataframe when computing consecutive columns requires the presence of the previous ones.
|
|
In this case, consecutive `withColumn` calls are a solution.
|
|
|
|
[source,python,diff-id=1,diff-type=compliant]
|
|
----
|
|
from pyspark.sql import SparkSession
|
|
from pyspark.sql.functions import col
|
|
spark = SparkSession.builder.getOrCreate()
|
|
df = spark.createDataFrame([[1,2],[2,3]], ["id", "value"])
|
|
df_with_new_cols = df.withColumn("squared_value", col("value") * col("value")).withColumn("cubic_value", col("squared_value") * col("value")) # Compliant
|
|
----
|
|
|
|
== How to fix it
|
|
To fix this issue, `withColumns` method should be used instead of multiple consecutive calls to `withColumn` method.
|
|
|
|
=== Code examples
|
|
|
|
==== Noncompliant code example
|
|
|
|
[source,python,diff-id=1,diff-type=noncompliant]
|
|
----
|
|
from pyspark.sql import SparkSession
|
|
from pyspark.sql.functions import col
|
|
spark = SparkSession.builder.getOrCreate()
|
|
df = spark.createDataFrame([[1,2],[2,3]], ["id", "value"])
|
|
df_with_new_cols = df.withColumn("value_plus_1", col("value") + 1).withColumn("value_plus_2", col("value") + 2).withColumn("value_plus_3", col("value") + 3) # Noncompliant
|
|
----
|
|
|
|
==== Compliant solution
|
|
|
|
[source,python,diff-id=1,diff-type=compliant]
|
|
----
|
|
from pyspark.sql import SparkSession
|
|
from pyspark.sql.functions import col
|
|
spark = SparkSession.builder.getOrCreate()
|
|
df = spark.createDataFrame([[1,2],[2,3]], ["id", "value"])
|
|
df_with_new_cols = df.withColumns({ # Compliant
|
|
"value_plus_1": col("value") + 1,
|
|
"value_plus_2": col("value") + 2,
|
|
"value_plus_3": col("value") + 3,
|
|
})
|
|
----
|
|
|
|
== Resources
|
|
=== Documentation
|
|
|
|
* PySpark withColumn Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withColumn.html[pyspark.sql.DataFrame.withColumn]
|
|
* PySpark withColumns Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.withColumns.html[pyspark.sql.DataFrame.withColumns]
|
|
|
|
=== Articles & blog posts
|
|
|
|
* Medium blog - https://blog.devgenius.io/why-to-avoid-multiple-chaining-of-withcolumn-function-in-spark-job-35ee8e09daaa[Why to avoid multiple chaining of withColumn() function in Spark job.]
|