Fix after review

This commit is contained in:
Guillaume Dequenne 2025-01-31 14:26:07 +01:00
parent 17a753a84e
commit acb69ec40d

View File

@ -8,7 +8,7 @@ PySpark is designed to handle large-scale data processing in a distributed manne
For this reason, it is generally advisable to avoid using `toPandas` unless you are certain that the dataset is small enough to be handled comfortably by a single machine. Instead, consider using Spark's built-in functions and capabilities to perform data processing tasks in a distributed manner.
If conversion to Pandas is necessary, ensure that the dataset size is manageable and that the conversion is justified by specific requirements, such as integration with libraries that require Pandas DataFrames.
If the conversion to Pandas is necessary, ensure that the dataset size is manageable and that the conversion is justified by specific requirements, such as integration with libraries that require Pandas DataFrames.
=== Exceptions
@ -31,7 +31,8 @@ To fix this issue, consider using PySpark built-in capabilities without relying
# Converting a PySpark DataFrame to a Pandas DataFrame
df = spark.read.csv("my_data.csv")
pandas_df = df.toPandas() # Noncompliant: May cause memory issues with large datasets
print(pandas_df)
filtered_df = df[df['id'] > 1]
print(filtered_df)
----
==== Compliant solution