This rule raises an issue when no value is provided to the `subset` parameter of PySpark DataFrame's `dropDuplicates` method. == Why is this an issue? In PySpark, the `dropDuplicates` method is used to remove duplicate rows from a DataFrame. By default, if no column names are provided, `dropDuplicates` will consider all columns to identify duplicates. This default is defensive and avoids removing rows that are partially similar, but it can also lead to: * unintended results. The simplest example would be to try removing duplicates on a DataFrame that holds a unique id per row. It is easy to forget that an id is part of a DataFrame and when trying to remove duplicates, the output DataFrame is the same as the input DataFrame. For example, applying `dropDuplicates` on the following DataFrame will not remove any rows: [source, text] ---- +---+-----+---+ | id| name|age| +---+-----+---+ | 1|Alice| 29| | 2| Bob| 29| | 3|Alice| 29| | 4|Alice| 30| | 5| Bob| 29| +---+-----+---+ ---- * performance inefficiencies. Identifying duplicates is a very costly operation, as Spark has to compare each column of each row with each other. To ensure clarity, prevent incorrect results, and optimize performance, it is a good practice to specify the column names when using `dropDuplicates`. This rule will raise issues on `pyspark.sql.DataFrame.dropDuplicates`, `pyspark.sql.DataFrame.drop_duplicates` and `pyspark.sql.DataFrame.dropDuplicatesWithinWaterMark`. === Exceptions If however, the intent is to remove duplicates based on all columns, the `distinct` method can be used, or the `None` value can be provided to the `subset` parameter. This way the intention is clear and this rule will not raise any issues. [source,python] ---- from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data = ... df = spark.createDataFrame(data, ["id", "name", "age"]) df_dedup = df.dropDuplicates(None) # Compliant df_dedup = df.dropDuplicates(subset=None) # Compliant df_dedup = df.distinct() # Compliant ---- == How to fix it To fix this issue, provide the column names to the `subset` parameter of the `dropDuplicates` method or use the `distinct` method instead. === Code examples ==== Noncompliant code example [source,python,diff-id=1,diff-type=noncompliant] ---- from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data = [ (1, "Alice", 29), (2, "Bob", 29), (3, "Alice", 29), (4, "Alice", 30), (5, "Bob", 29) ] df = spark.createDataFrame(data, ["id", "name", "age"]) df_dedup = df.dropDuplicates() # Non-compliant: No column names are specified ---- The above code example result in no rows being removed: [cols="1,3,1"] |=== |id |name | age | 1|Alice| 29 | 2| Bob| 29 | 3|Alice| 29 | 4|Alice| 30 | 5| Bob| 29 |=== ==== Compliant solution [source,python,diff-id=1,diff-type=compliant] ---- from pyspark.sql import SparkSession spark = SparkSession.builder.getOrCreate() data = [ (1, "Alice", 29), (2, "Bob", 29), (3, "Alice", 29), (4, "Alice", 30), (5, "Bob", 29) ] df = spark.createDataFrame(data, ["id", "name", "age"]) df_dedup = df.dropDuplicates(subset=["name", "age"]) # Compliant ---- In this example duplicates are removed based on the `name` and `age` columns: [cols="1,3,1"] |=== |id |name | age | 1|Alice| 29 | 2| Bob| 29 | 4|Alice| 30 |=== == Resources === Documentation * PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.DataFrame.dropDuplicates.html[pyspark.sql.DataFrame.dropDuplicates] === Articles & blog posts * stratascratch blog - https://www.stratascratch.com/blog/how-to-drop-duplicates-in-pyspark/[How to drop duplicates in PySpark] * Medium blog - https://medium.com/@santosh_beora/distinct-and-dropduplicates-in-pyspark-fedb1e9e8738[distinct() and dropDuplicates() in PySpark]