Create rule S7195: PySpark lit(None) should be used when populating empty columns (#4638)

This commit is contained in:
github-actions[bot] 2025-02-19 10:58:21 +00:00 committed by GitHub
parent 9966f12d52
commit c046fc94c4
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
3 changed files with 93 additions and 0 deletions

View File

@ -0,0 +1,2 @@
{
}

View File

@ -0,0 +1,25 @@
{
"title": "PySpark lit(None) should be used when populating empty columns",
"type": "CODE_SMELL",
"status": "ready",
"remediation": {
"func": "Constant\/Issue",
"constantCost": "5min"
},
"tags": [
"data-science",
"pyspark"
],
"defaultSeverity": "Major",
"ruleSpecification": "RSPEC-7195",
"sqKey": "S7195",
"scope": "All",
"defaultQualityProfiles": ["Sonar way"],
"quickfix": "unknown",
"code": {
"impacts": {
"RELIABILITY": "MEDIUM"
},
"attribute": "CONVENTIONAL"
}
}

View File

@ -0,0 +1,66 @@
This rule raises an issue when a column of a PySpark DataFrame is populated with `lit('')`.
== Why is this an issue?
In PySpark, when populating a DataFrame column with empty or null values, it is recommended to use `lit(None)`.
Using literals such as `lit('')` as a placeholder for absent values can lead to data misinterpretation and inconsistencies.
The usage of `lit(None)` ensures clarity and consistency in the codebase, making it explicit that the column is intentionally populated with null values.
Using `lit(None)` also preserves the ability to use functions such as `isnull` or `isnotnull` to check for null values in the DataFrame.
== How to fix it
To fix this issue, replace `lit('')` with `lit(None)` when populating a DataFrame column with empty/null values.
=== Code examples
==== Noncompliant code example
[source,python,diff-id=1,diff-type=noncompliant]
----
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
spark = SparkSession.builder.appName("Example").getOrCreate()
data = [
(1, "Alice"),
(2, "Bob"),
(3, "Charlie")
]
df = spark.createDataFrame(data, ["id", "name"])
df_with_empty_column = df.withColumn("middle_name", lit('')) # Noncompliant: usage of lit('') to represent en empty value
----
==== Compliant solution
[source,python,diff-id=1,diff-type=compliant]
----
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
spark = SparkSession.builder.appName("Example").getOrCreate()
data = [
(1, "Alice"),
(2, "Bob"),
(3, "Charlie")
]
df = spark.createDataFrame(data, ["id", "name"])
df_with_empty_column = df.withColumn("middle_name", lit(None)) # Compliant
----
== Resources
=== Documentation
* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.lit.html#pyspark-sql-functions-lit[pyspark-sql-functions-lit]
* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.isnull.html#pyspark-sql-functions-isnull[pyspark-sql-functions-isnull]
=== Standards
* Palantir PySpark Style Guide - https://github.com/palantir/pyspark-style-guide?tab=readme-ov-file#empty-columns[empty-columns]