Create rule S7182: The subset argument should be provided when using PySpark DataFrame dropDuplicates (#4615)

* Create Rule S7182: The `subset` argument should be provided when using PySpark DataFrame `dropDuplicates` --------- Co-authored-by: joke1196 <joke1196@users.noreply.github.com> Co-authored-by: David Kunzmann <david.kunzmann@sonarsource.com>
2025-02-20 11:20:42 +01:00 · 2025-02-20 11:20:42 +01:00 · 9d7de6d39d
commit 9d7de6d39d
parent fdf295d151
3 changed files with 162 additions and 0 deletions
--- a/rules/S7182/metadata.json
+++ b/rules/S7182/metadata.json
@ -0,0 +1,2 @@
+{
+}
--- a/rules/S7182/python/metadata.json
+++ b/rules/S7182/python/metadata.json
@ -0,0 +1,26 @@
+{
+  "title": "The \"subset\" argument should be provided when using PySpark DataFrame \"dropDuplicates\" method",
+  "type": "CODE_SMELL",
+  "status": "ready",
+  "remediation": {
+    "func": "Constant\/Issue",
+    "constantCost": "5min"
+  },
+  "tags": [
+    "pyspark",
+    "data-science"
+  ],
+  "defaultSeverity": "Major",
+  "ruleSpecification": "RSPEC-7182",
+  "sqKey": "S7182",
+  "scope": "All",
+  "defaultQualityProfiles": ["Sonar way"],
+  "quickfix": "partial",
+  "code": {
+    "impacts": {
+      "MAINTAINABILITY": "MEDIUM",
+      "RELIABILITY": "MEDIUM"
+    },
+    "attribute": "CONVENTIONAL"
+  }
+}
--- a/rules/S7182/python/rule.adoc
+++ b/rules/S7182/python/rule.adoc
@ -0,0 +1,134 @@
+This rule raises an issue when no value is provided to the `subset` parameter of PySpark DataFrame's `dropDuplicates` method.
+
+== Why is this an issue?
+
+In PySpark, the `dropDuplicates` method is used to remove duplicate rows from a DataFrame. 
+By default, if no column names are provided, `dropDuplicates` will consider all columns to identify duplicates. 
+
+This default is defensive and avoids removing rows that are partially similar, but it can also lead to: 
+
+ * unintended results. The simplest example would be to try removing duplicates on a DataFrame that holds
+a unique id per row. It is easy to forget that an id is part of a DataFrame and when trying to remove duplicates, the output DataFrame is the same as the input DataFrame.
+For example, applying `dropDuplicates` on the following DataFrame will not remove any rows:
+
+[source, text]
+----
+---+-----+---+
+| id| name|age|
+---+-----+---+
+|  1|Alice| 29|
+|  2|  Bob| 29|
+|  3|Alice| 29|
+|  4|Alice| 30|
+|  5|  Bob| 29|
+---+-----+---+
+----
+
+ * performance inefficiencies. Identifying duplicates is a very costly operation, as Spark has to compare each column of each row with each other.
+
+To ensure clarity, prevent incorrect results, and optimize performance, 
+it is a good practice to specify the column names when using `dropDuplicates`.
+
+This rule will raise issues on `pyspark.sql.DataFrame.dropDuplicates`, `pyspark.sql.DataFrame.drop_duplicates`
+and `pyspark.sql.DataFrame.dropDuplicatesWithinWaterMark`.
+
+=== Exceptions
+
+If however, the intent is to remove duplicates based on all columns, the `distinct` method can be used, or 
+the `None` value can be provided to the `subset` parameter. This way the intention is clear and this rule will not raise any issues.
+
+
+[source,python]
+----
+from pyspark.sql import SparkSession
+
+spark = SparkSession.builder.getOrCreate()
+data = ...
+
+df = spark.createDataFrame(data, ["id", "name", "age"])
+
+df_dedup = df.dropDuplicates(None) # Compliant
+df_dedup = df.dropDuplicates(subset=None) # Compliant
+df_dedup = df.distinct() # Compliant
+
+----
+
+== How to fix it
+
+To fix this issue, provide the column names to the `subset` parameter of the `dropDuplicates` method or use the `distinct` method instead.
+
+=== Code examples
+
+==== Noncompliant code example
+
+[source,python,diff-id=1,diff-type=noncompliant]
+----
+from pyspark.sql import SparkSession
+
+spark = SparkSession.builder.getOrCreate()
+data = [
+    (1, "Alice", 29),
+    (2, "Bob", 29),
+    (3, "Alice", 29),
+    (4, "Alice", 30),
+    (5, "Bob", 29)
+]
+df = spark.createDataFrame(data, ["id", "name", "age"])
+
+df_dedup = df.dropDuplicates() # Non-compliant: No column names are specified
+----
+
+The above code example result in no rows being removed:
+
+[cols="1,3,1"]
+|===
+|id |name | age
+
+|  1|Alice| 29
+|  2|  Bob| 29
+|  3|Alice| 29
+|  4|Alice| 30
+|  5|  Bob| 29
+
+|===
+
+==== Compliant solution
+
+[source,python,diff-id=1,diff-type=compliant]
+----
+from pyspark.sql import SparkSession
+
+spark = SparkSession.builder.getOrCreate()
+data = [
+    (1, "Alice", 29),
+    (2, "Bob", 29),
+    (3, "Alice", 29),
+    (4, "Alice", 30),
+    (5, "Bob", 29)
+]
+df = spark.createDataFrame(data, ["id", "name", "age"])
+
+df_dedup = df.dropDuplicates(subset=["name", "age"]) # Compliant
+----
+
+In this example duplicates are removed based on the `name` and `age` columns:
+
+[cols="1,3,1"]
+|===
+|id |name | age
+
+|  1|Alice| 29
+|  2|  Bob| 29
+|  4|Alice| 30
+
+|===
+
+== Resources
+=== Documentation
+
+ * PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.dropDuplicates.html[pyspark.sql.DataFrame.dropDuplicates]
+
+=== Articles & blog posts
+
+ * stratascratch blog - https://www.stratascratch.com/blog/how-to-drop-duplicates-in-pyspark/[How to drop duplicates in PySpark]
+ * Medium blog - https://medium.com/@santosh_beora/distinct-and-dropduplicates-in-pyspark-fedb1e9e8738[distinct() and dropDuplicates() in PySpark]