67 lines
2.2 KiB
Plaintext
67 lines
2.2 KiB
Plaintext
This rule raises an issue when a column of a PySpark DataFrame is populated with `lit('')`.
|
|
|
|
== Why is this an issue?
|
|
|
|
In PySpark, when populating a DataFrame column with empty or null values, it is recommended to use `lit(None)`.
|
|
Using literals such as `lit('')` as a placeholder for absent values can lead to data misinterpretation and inconsistencies.
|
|
|
|
The usage of `lit(None)` ensures clarity and consistency in the codebase, making it explicit that the column is intentionally populated with null values.
|
|
Using `lit(None)` also preserves the ability to use functions such as `isnull` or `isnotnull` to check for null values in the DataFrame.
|
|
|
|
== How to fix it
|
|
|
|
To fix this issue, replace `lit('')` with `lit(None)` when populating a DataFrame column with empty/null values.
|
|
|
|
=== Code examples
|
|
|
|
==== Noncompliant code example
|
|
|
|
[source,python,diff-id=1,diff-type=noncompliant]
|
|
----
|
|
from pyspark.sql import SparkSession
|
|
from pyspark.sql.functions import lit
|
|
|
|
spark = SparkSession.builder.appName("Example").getOrCreate()
|
|
|
|
data = [
|
|
(1, "Alice"),
|
|
(2, "Bob"),
|
|
(3, "Charlie")
|
|
]
|
|
|
|
df = spark.createDataFrame(data, ["id", "name"])
|
|
|
|
df_with_empty_column = df.withColumn("middle_name", lit('')) # Noncompliant: usage of lit('') to represent en empty value
|
|
----
|
|
|
|
==== Compliant solution
|
|
|
|
[source,python,diff-id=1,diff-type=compliant]
|
|
----
|
|
from pyspark.sql import SparkSession
|
|
from pyspark.sql.functions import lit
|
|
|
|
spark = SparkSession.builder.appName("Example").getOrCreate()
|
|
|
|
data = [
|
|
(1, "Alice"),
|
|
(2, "Bob"),
|
|
(3, "Charlie")
|
|
]
|
|
|
|
df = spark.createDataFrame(data, ["id", "name"])
|
|
|
|
df_with_empty_column = df.withColumn("middle_name", lit(None)) # Compliant
|
|
----
|
|
|
|
== Resources
|
|
=== Documentation
|
|
|
|
* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.lit.html#pyspark-sql-functions-lit[pyspark-sql-functions-lit]
|
|
* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.isnull.html#pyspark-sql-functions-isnull[pyspark-sql-functions-isnull]
|
|
|
|
=== Standards
|
|
|
|
* Palantir PySpark Style Guide - https://github.com/palantir/pyspark-style-guide?tab=readme-ov-file#empty-columns[empty-columns]
|
|
|