Create rule S7187: PySpark Pandas DataFrame columns should not use a reserved name (#4622)

* Create rule S7187: PySpark Pandas DataFrame columns should not use a reserved name --------- Co-authored-by: joke1196 <joke1196@users.noreply.github.com> Co-authored-by: David Kunzmann <david.kunzmann@sonarsource.com>
2025-02-20 11:22:12 +01:00 · 2025-02-20 11:22:12 +01:00 · ba18ae7f08
commit ba18ae7f08
parent f26dc7084d
3 changed files with 69 additions and 0 deletions
--- a/rules/S7187/metadata.json
+++ b/rules/S7187/metadata.json
@ -0,0 +1,2 @@
 {
 }
--- a/rules/S7187/python/metadata.json
+++ b/rules/S7187/python/metadata.json
@ -0,0 +1,25 @@
 {
  "title": "PySpark Pandas DataFrame columns should not use a reserved name",
  "type": "CODE_SMELL",
  "status": "ready",
  "remediation": {
    "func": "Constant\/Issue",
    "constantCost": "5min"
  },
  "tags": [
    "data-science",
    "pyspark"
  ],
  "defaultSeverity": "Major",
  "ruleSpecification": "RSPEC-7187",
  "sqKey": "S7187",
  "scope": "All",
  "defaultQualityProfiles": ["Sonar way"],
  "quickfix": "infeasible",
  "code": {
    "impacts": {
      "RELIABILITY": "MEDIUM"
    },
    "attribute": "CONVENTIONAL"
  }
 }
--- a/rules/S7187/python/rule.adoc
+++ b/rules/S7187/python/rule.adoc
@ -0,0 +1,42 @@
 This rule raises an issue when a PySpark Pandas DataFrame column name is set to a reserved name.
 == Why is this an issue?
 PySpark offers powerful APIs to work with Pandas DataFrames in a distributed environment. 
 While the integration between PySpark and Pandas is seamless, there are some caveats that should be taken into account.
 Spark Pandas API uses some special column names for internal purposes. 
 These column names contain leading `++__++` and trailing `++__++`.
 Therefore, when using PySpark with Pandas and naming or renaming columns,
 it is discouraged to use such reserved column names as they are not guaranteed to yield the expected results.
 == How to fix it
 To fix this issue provide a column name without leading and trailing `++__++`.
 === Code examples
 ==== Noncompliant code example
 [source,python,diff-id=1,diff-type=noncompliant]
 ----
 import pyspark.pandas as ps
 df = ps.DataFrame({'__value__': [1, 2, 3]}) # Noncompliant: __value__ is a reserved column name
 ----
 ==== Compliant solution
 [source,python,diff-id=1,diff-type=compliant]
 ----
 import pyspark.pandas as ps
 df = ps.DataFrame({'value': [1, 2, 3]}) # Compliant
 ----
 == Resources
 === Documentation
 * PySpark Documentation - https://spark.apache.org/docs/latest/api/python/user_guide/pandas_on_spark/best_practices.html#avoid-reserved-column-names[Best Practices]