Create rule S7187: PySpark Pandas DataFrame columns should not use a reserved name (#4622)
* Create rule S7187: PySpark Pandas DataFrame columns should not use a reserved name --------- Co-authored-by: joke1196 <joke1196@users.noreply.github.com> Co-authored-by: David Kunzmann <david.kunzmann@sonarsource.com>
This commit is contained in:
parent
f26dc7084d
commit
ba18ae7f08
2
rules/S7187/metadata.json
Normal file
2
rules/S7187/metadata.json
Normal file
@ -0,0 +1,2 @@
|
|||||||
|
{
|
||||||
|
}
|
25
rules/S7187/python/metadata.json
Normal file
25
rules/S7187/python/metadata.json
Normal file
@ -0,0 +1,25 @@
|
|||||||
|
{
|
||||||
|
"title": "PySpark Pandas DataFrame columns should not use a reserved name",
|
||||||
|
"type": "CODE_SMELL",
|
||||||
|
"status": "ready",
|
||||||
|
"remediation": {
|
||||||
|
"func": "Constant\/Issue",
|
||||||
|
"constantCost": "5min"
|
||||||
|
},
|
||||||
|
"tags": [
|
||||||
|
"data-science",
|
||||||
|
"pyspark"
|
||||||
|
],
|
||||||
|
"defaultSeverity": "Major",
|
||||||
|
"ruleSpecification": "RSPEC-7187",
|
||||||
|
"sqKey": "S7187",
|
||||||
|
"scope": "All",
|
||||||
|
"defaultQualityProfiles": ["Sonar way"],
|
||||||
|
"quickfix": "infeasible",
|
||||||
|
"code": {
|
||||||
|
"impacts": {
|
||||||
|
"RELIABILITY": "MEDIUM"
|
||||||
|
},
|
||||||
|
"attribute": "CONVENTIONAL"
|
||||||
|
}
|
||||||
|
}
|
42
rules/S7187/python/rule.adoc
Normal file
42
rules/S7187/python/rule.adoc
Normal file
@ -0,0 +1,42 @@
|
|||||||
|
This rule raises an issue when a PySpark Pandas DataFrame column name is set to a reserved name.
|
||||||
|
|
||||||
|
== Why is this an issue?
|
||||||
|
|
||||||
|
PySpark offers powerful APIs to work with Pandas DataFrames in a distributed environment.
|
||||||
|
While the integration between PySpark and Pandas is seamless, there are some caveats that should be taken into account.
|
||||||
|
|
||||||
|
Spark Pandas API uses some special column names for internal purposes.
|
||||||
|
These column names contain leading `++__++` and trailing `++__++`.
|
||||||
|
Therefore, when using PySpark with Pandas and naming or renaming columns,
|
||||||
|
it is discouraged to use such reserved column names as they are not guaranteed to yield the expected results.
|
||||||
|
|
||||||
|
== How to fix it
|
||||||
|
|
||||||
|
To fix this issue provide a column name without leading and trailing `++__++`.
|
||||||
|
|
||||||
|
=== Code examples
|
||||||
|
|
||||||
|
==== Noncompliant code example
|
||||||
|
|
||||||
|
[source,python,diff-id=1,diff-type=noncompliant]
|
||||||
|
----
|
||||||
|
import pyspark.pandas as ps
|
||||||
|
|
||||||
|
df = ps.DataFrame({'__value__': [1, 2, 3]}) # Noncompliant: __value__ is a reserved column name
|
||||||
|
----
|
||||||
|
|
||||||
|
==== Compliant solution
|
||||||
|
|
||||||
|
[source,python,diff-id=1,diff-type=compliant]
|
||||||
|
----
|
||||||
|
import pyspark.pandas as ps
|
||||||
|
|
||||||
|
df = ps.DataFrame({'value': [1, 2, 3]}) # Compliant
|
||||||
|
----
|
||||||
|
|
||||||
|
|
||||||
|
== Resources
|
||||||
|
=== Documentation
|
||||||
|
|
||||||
|
* PySpark Documentation - https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/best_practices.html#avoid-reserved-column-names[Best Practices]
|
||||||
|
|
Loading…
x
Reference in New Issue
Block a user