github-actions[bot] d0f831d4f6
Create rule S6735: When using pandas.merge or pandas.join, the parameters on, how and validate should be specified (#2982)
You can preview this rule
[here](https://sonarsource.github.io/rspec/#/rspec/S6735/python)
(updated a few minutes after each push).

## Review

A dedicated reviewer checked the rule description successfully for:

- [ ] logical errors and incorrect information
- [ ] information gaps and missing content
- [ ] text style and tone
- [ ] PR summary and labels follow [the
guidelines](https://github.com/SonarSource/rspec/#to-modify-an-existing-rule)

---------

Co-authored-by: joke1196 <joke1196@users.noreply.github.com>
Co-authored-by: David Kunzmann <david.kunzmann@sonarsource.com>
Co-authored-by: Guillaume Dequenne <guillaume.dequenne@sonarsource.com>
2023-10-06 11:48:43 +02:00

131 lines
5.3 KiB
Plaintext

This rule raises an issue when the parameters ``++how++``, ``++on++`` and ``++validate++`` are not provided when using ``++pandas.merge++`` or ``++pandas.join++``.
== Why is this an issue?
:merge_link: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html#pandas-dataframe-merge
:join_link: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.join.html#pandas-dataframe-join
The Pandas library provides a user-friendly API to concatenate two data frames together with the methods ``++merge++`` and ``++join++``.
When using these methods, it is possible to specify how the merge will be performed:
* The parameter ``++how++`` specifies the type of merge (``++left++``, ``++inner++``, ``++outer++``, etc..).
* The parameter ``++on++`` specifies the column(s) on which the merge will be performed.
* The parameter ``++validate++`` specifies a way to verify if the merge result is what was expected.
[source,python]
----
import pandas as pd
age_df = pd.DataFrame({"user_id":[1,2,4], "age":[42,45, 35]})
name_df = pd.DataFrame({"user_id":[1,2,3,4], "name":["a","b","c","d"]})
result = age_df.merge(name_df, on="user_id", how="right", validate="1:1")
----
In the example above, both data frames will be merged together based on the column ``++user_id++``, specified by the parameter ``++on++``.
The parameter ``++how++`` set to ``++right++`` states that the resulting data frame will contain all the ``++user_id++``s present in the data frame ``++name_df++``
(including ``++3++``, which is absent from ``++age_df++`` and will therefore be assigned a ``++NaN++`` value for the ``++age++`` column).
Lastly, setting the ``++validate++`` parameter to ``++1:1++`` means a check will be performed
to verify that the column used for the merge only contains unique keys in both data frames.
If this check fails a ``++MergeError++`` will be raised.
Here is the resulting data frame:
[cols="1,1,1,1"]
|===
|row |user_id | age | name
| 0 | 1 | 42 | a
| 1 | 2 | 45 | b
| 2 | 3 | NaN | c
| 3 | 4 | 35 | d
|===
More information about these methods and their parameters can be found in the pandas documentation: {merge_link}[merge] and {join_link}[join].
The ``++how++``, ``++on++`` and ``++validate++`` parameters are optional and pandas provides sensible default values.
This means ``++merge++`` could be used as follow:
[source,python]
----
import pandas as pd
age_df = pd.DataFrame({"user_id":[1,2,4], "age":[42,45, 35]})
name_df = pd.DataFrame({"user_id":[1,2,3,4], "name":["a","b","c","d"]})
result = age_df.merge(name_df)
----
In this example:
* The ``++how++`` parameter defaults to ``++inner++``.
* The ``++on++`` parameter defaults to the columns which have a similar name, in our case ``++user_id++`` .
* The ``++validate++`` parameter will be set to ``++many_to_many++``, meaning no validation will be performed.
Here is the resulting data frame:
[cols="1,1,1,1"]
|===
|row |user_id | age | name
| 0 | 1 | 42 | a
| 1 | 2 | 45 | b
| 2 | 4 | 35 | d
|===
While the example above is perfectly valid, using the ``++merge++`` and ``++join++`` methods without providing the ``++how++``,
``++on++`` and ``++validate++`` arguments has two main drawbacks:
* It makes the code intention unclear: without the ``++how++`` parameter set, it is unclear if the developer noticed that a ``++user_id++`` (``++3++``) will be missing from the resulting data frame, or if it is done on purpose.
* It makes the code harder to maintain: if one of the data frame would change its ``++user_id++`` column name to ``++id++``, the code would still run but the result would be entirely different.
In order to mitigate these drawbacks, setting the ``++how++`` parameter to ``++inner++`` would better convey that the intention is to only keep ``++user_id++``s present in both data frame.
Setting the ``++on++`` parameter to ``++user_id++`` could also avoid issues in the future, for example if the input data frames came to change.
The information provided by these parameters is extremely valuable, especially when another developer is in charge of refactoring or
debugging this particular piece of code.
This is why it is a good practice to provide the parameters ``++how++``, ``++on++`` (or ``++left_on++`` and ``++right_on++``) and ``++validate++`` to the pandas' ``++merge++`` and ``++join++``.
== How to fix it
To fix this issue provide the parameters ``++how++``, ``++on++`` (or ``++left_on++`` and ``++right_on++``) and ``++validate++`` to the ``++pd.merge++`` or ``++pd.join++`` methods.
=== Code examples
==== Noncompliant code example
[source,python,diff-id=1,diff-type=noncompliant]
----
import pandas as pd
def merge_dfs(age_df:pd.DataFrame, name_df:pd.DataFrame):
return age_df.merge(name_df) # Noncompliant: it is unclear on which column the merge should happen, as well as what is the expected result.
----
==== Compliant solution
[source,python,diff-id=1,diff-type=compliant]
----
import pandas as pd
def merge_dfs(age_df:pd.DataFrame, name_df:pd.DataFrame):
return age_df.merge(
name_df,
on="user_id",
how="inner",
validate="1:1"
) # Compliant
----
== Resources
=== Documentation
* Pandas Documentation - {merge_link}[pandas.DataFrame.merge]
* Pandas Documentation - {join_link}[pandas.DataFrame.join]