Create rule S7195: PySpark lit(None) should be used when populating empty columns (#4638)

2025-02-19 10:58:21 +00:00 · 2025-02-19 10:58:21 +00:00 · c046fc94c4
commit c046fc94c4
parent 9966f12d52
3 changed files with 93 additions and 0 deletions
--- a/rules/S7195/metadata.json
+++ b/rules/S7195/metadata.json
@ -0,0 +1,2 @@
+{
+}
--- a/rules/S7195/python/metadata.json
+++ b/rules/S7195/python/metadata.json
@ -0,0 +1,25 @@
+{
+  "title": "PySpark lit(None) should be used when populating empty columns",
+  "type": "CODE_SMELL",
+  "status": "ready",
+  "remediation": {
+    "func": "Constant\/Issue",
+    "constantCost": "5min"
+  },
+  "tags": [
+    "data-science",
+    "pyspark"
+  ],
+  "defaultSeverity": "Major",
+  "ruleSpecification": "RSPEC-7195",
+  "sqKey": "S7195",
+  "scope": "All",
+  "defaultQualityProfiles": ["Sonar way"],
+  "quickfix": "unknown",
+  "code": {
+    "impacts": {
+      "RELIABILITY": "MEDIUM"
+    },
+    "attribute": "CONVENTIONAL"
+  }
+}
--- a/rules/S7195/python/rule.adoc
+++ b/rules/S7195/python/rule.adoc
@ -0,0 +1,66 @@
+This rule raises an issue when a column of a PySpark DataFrame is populated with `lit('')`.
+
+== Why is this an issue?
+
+In PySpark, when populating a DataFrame column with empty or null values, it is recommended to use `lit(None)`. 
+Using literals such as `lit('')` as a placeholder for absent values can lead to data misinterpretation and inconsistencies.
+
+The usage of `lit(None)` ensures clarity and consistency in the codebase, making it explicit that the column is intentionally populated with null values.
+Using `lit(None)` also preserves the ability to use functions such as `isnull` or `isnotnull` to check for null values in the DataFrame.
+
+== How to fix it
+
+To fix this issue, replace `lit('')` with `lit(None)` when populating a DataFrame column with empty/null values.
+
+=== Code examples
+
+==== Noncompliant code example
+
+[source,python,diff-id=1,diff-type=noncompliant]
+----
+from pyspark.sql import SparkSession
+from pyspark.sql.functions import lit
+
+spark = SparkSession.builder.appName("Example").getOrCreate()
+
+data = [
+    (1, "Alice"),
+    (2, "Bob"),
+    (3, "Charlie")
+]
+
+df = spark.createDataFrame(data, ["id", "name"])
+
+df_with_empty_column = df.withColumn("middle_name", lit('')) # Noncompliant: usage of lit('') to represent en empty value
+----
+
+==== Compliant solution
+
+[source,python,diff-id=1,diff-type=compliant]
+----
+from pyspark.sql import SparkSession
+from pyspark.sql.functions import lit
+
+spark = SparkSession.builder.appName("Example").getOrCreate()
+
+data = [
+    (1, "Alice"),
+    (2, "Bob"),
+    (3, "Charlie")
+]
+
+df = spark.createDataFrame(data, ["id", "name"])
+
+df_with_empty_column = df.withColumn("middle_name", lit(None)) # Compliant
+----
+
+== Resources
+=== Documentation
+
+* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.lit.html#pyspark-sql-functions-lit[pyspark-sql-functions-lit]
+* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.isnull.html#pyspark-sql-functions-isnull[pyspark-sql-functions-isnull]
+
+=== Standards
+
+* Palantir PySpark Style Guide - https://github.com/palantir/pyspark-style-guide?tab=readme-ov-file#empty-columns[empty-columns]
+