Fix after review
This commit is contained in:
parent
17a753a84e
commit
acb69ec40d
@ -8,7 +8,7 @@ PySpark is designed to handle large-scale data processing in a distributed manne
|
||||
|
||||
For this reason, it is generally advisable to avoid using `toPandas` unless you are certain that the dataset is small enough to be handled comfortably by a single machine. Instead, consider using Spark's built-in functions and capabilities to perform data processing tasks in a distributed manner.
|
||||
|
||||
If conversion to Pandas is necessary, ensure that the dataset size is manageable and that the conversion is justified by specific requirements, such as integration with libraries that require Pandas DataFrames.
|
||||
If the conversion to Pandas is necessary, ensure that the dataset size is manageable and that the conversion is justified by specific requirements, such as integration with libraries that require Pandas DataFrames.
|
||||
|
||||
=== Exceptions
|
||||
|
||||
@ -31,7 +31,8 @@ To fix this issue, consider using PySpark built-in capabilities without relying
|
||||
# Converting a PySpark DataFrame to a Pandas DataFrame
|
||||
df = spark.read.csv("my_data.csv")
|
||||
pandas_df = df.toPandas() # Noncompliant: May cause memory issues with large datasets
|
||||
print(pandas_df)
|
||||
filtered_df = df[df['id'] > 1]
|
||||
print(filtered_df)
|
||||
----
|
||||
|
||||
==== Compliant solution
|
||||
|
Loading…
x
Reference in New Issue
Block a user