rspec/rules/S7195/python/rule.adoc

67 lines
2.2 KiB
Plaintext
Raw Normal View History

This rule raises an issue when a column of a PySpark DataFrame is populated with `lit('')`.
== Why is this an issue?
In PySpark, when populating a DataFrame column with empty or null values, it is recommended to use `lit(None)`.
Using literals such as `lit('')` as a placeholder for absent values can lead to data misinterpretation and inconsistencies.
The usage of `lit(None)` ensures clarity and consistency in the codebase, making it explicit that the column is intentionally populated with null values.
Using `lit(None)` also preserves the ability to use functions such as `isnull` or `isnotnull` to check for null values in the DataFrame.
== How to fix it
To fix this issue, replace `lit('')` with `lit(None)` when populating a DataFrame column with empty/null values.
=== Code examples
==== Noncompliant code example
[source,python,diff-id=1,diff-type=noncompliant]
----
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
spark = SparkSession.builder.appName("Example").getOrCreate()
data = [
(1, "Alice"),
(2, "Bob"),
(3, "Charlie")
]
df = spark.createDataFrame(data, ["id", "name"])
df_with_empty_column = df.withColumn("middle_name", lit('')) # Noncompliant: usage of lit('') to represent en empty value
----
==== Compliant solution
[source,python,diff-id=1,diff-type=compliant]
----
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit
spark = SparkSession.builder.appName("Example").getOrCreate()
data = [
(1, "Alice"),
(2, "Bob"),
(3, "Charlie")
]
df = spark.createDataFrame(data, ["id", "name"])
df_with_empty_column = df.withColumn("middle_name", lit(None)) # Compliant
----
== Resources
=== Documentation
* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.lit.html#pyspark-sql-functions-lit[pyspark-sql-functions-lit]
* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.isnull.html#pyspark-sql-functions-isnull[pyspark-sql-functions-isnull]
=== Standards
* Palantir PySpark Style Guide - https://github.com/palantir/pyspark-style-guide?tab=readme-ov-file#empty-columns[empty-columns]