github-actions[bot] c43d5c93de
Create rule S6734: inplace=True should not be used when modifying a Pandas DataFrame (#2979)
You can preview this rule
[here](https://sonarsource.github.io/rspec/#/rspec/S6734/python)
(updated a few minutes after each push).

## Review

A dedicated reviewer checked the rule description successfully for:

- [ ] logical errors and incorrect information
- [ ] information gaps and missing content
- [ ] text style and tone
- [ ] PR summary and labels follow [the
guidelines](https://github.com/SonarSource/rspec/#to-modify-an-existing-rule)

---------

Co-authored-by: guillaume-dequenne-sonarsource <guillaume-dequenne-sonarsource@users.noreply.github.com>
Co-authored-by: Guillaume Dequenne <guillaume.dequenne@sonarsource.com>
2023-10-06 11:24:42 +02:00

92 lines
3.1 KiB
Plaintext

This rule raises an issue when the `inplace` parameter is set to `True` when modifying a Pandas DataFrame.
== Why is this an issue?
Using `inplace=True` when modifying a Pandas DataFrame means that the method will modify the DataFrame in place, rather than returning a new object:
[source,python]
----
df.an_operation(inplace=True)
----
When `inplace` is `False` (which is the default behavior), a new object is returned instead:
[source,python]
----
df2 = df.an_operation(inplace=False)
----
Generally speaking, the motivation for modifying an object in place is to improve efficiency by avoiding the creation of a copy of the original object. Unfortunately, many methods supporting the inplace keyword either cannot actually be done inplace, or make a copy as a consequence of the operations they perform, regardless of whether `inplace` is `True` or not. For example, the following methods can never operate in place:
* drop (dropping rows)
* dropna
* drop_duplicates
* sort_values
* sort_index
* eval
* query
Because of this, expecting efficiency gains through the use of `inplace=True` is not reliable.
Additionally, using `inplace=True` may trigger a `SettingWithCopyWarning` and make the overall intention of the code unclear. In the following example, modifying `df2` will not modify the original `df` dataframe, and a warning will be raised:
[source,python]
----
df = pd.DataFrame({'a': [3, 2, 1], 'b': ['x', 'y', 'z']})
df2 = df[df['a'] > 1]
df2['b'].replace({'x': 'abc'}, inplace=True)
# SettingWithCopyWarning:
# A value is trying to be set on a copy of a slice from a DataFrame
----
In general, side effects such as object mutation may be the source of subtle bugs and explicit reassignment is considered safer.
When intermediate results are not needed, method chaining is a more explicit alternative to the `inplace` parameter. For instance, one may write:
[source,python]
----
df.drop('City', axis=1, inplace=True)
df.sort_values('Name', inplace=True)
df.reset_index(drop=True, inplace=True)
----
Through method chaining, this previous example may be rewritten as:
[source,python]
----
result = df.drop('City', axis=1).sort_values('Name').reset_index(drop=True)
----
For these reasons, it is therefore recommended to avoid using `inplace=True` in favor of more explicit and less error-prone alternatives.
== How to fix it
To fix this issue, avoid using the `inplace=True` parameter. Either opt for method chaining when intermediary results are not needed, or for explicit reassignment when the intention is to perform a simple operation.
=== Code examples
==== Noncompliant code example
[source,python,diff-id=1,diff-type=noncompliant]
----
import pandas as pd
def foo():
df.drop(columns='A', inplace=True) # Noncompliant: Using inplace=True is error-prone and should be avoided
----
==== Compliant solution
[source,python,diff-id=1,diff-type=compliant]
----
import pandas as pd
def foo():
df = df.drop(columns='A') # OK: explicit reassignment
----
== Resources
=== Documentation
* Pandas Enhancement Proposal - https://github.com/pandas-dev/pandas/pull/51466[PDEP-8: In-place methods in pandas]