rspec/rules/S6741/python/rule.adoc

This rule raises an issue when the ``++pandas.DataFrame.values++`` is used instead of the ``++pandas.DataFrame.to_numpy()++`` method.

== Why is this an issue?

The ``++values++`` attribute and the ``++to_numpy()++`` method in pandas both provide a way to return a NumPy representation of the ``++DataFrame++``. However, there are some reasons why the ``++to_numpy()++`` method is recommended over the ``++values++`` attribute:

* *Future Compatibility:*
The ``++values++`` attribute is considered a legacy feature, while the ``++to_numpy()++`` is the recommended method to extract data and is considered more future-proof.
* *Data type consistency:*
If the ``++DataFrame++`` has columns with different data types, NumPy will choose a common data type that can hold all the data. This may lead to loss of information, unexpected type conversions, or increased memory usage. The ``++to_numpy()++`` allows you to select the common type manually, passing the ``++dtype++`` argument.
* *View vs Copy:*
The ``++values++`` attribute can return a view or a copy of the data depending on whether the data needs to be transposed. This can lead to confusion when modifying the extracted data. On the other hand, ``++to_numpy()++`` has ``++copy++`` argument allowing to force it always to return a new NumPy array, ensuring that any changes you make won't affect the original ``++DataFrame++``.
* *Missing values control:*
The ``++to_numpy()++`` allows to specify the default value used for missing values in the ``++DataFrame++``, while the ``++values++`` will always use ``++numpy.nan++`` for missing values.

== How to fix it
Use the ``++to_numpy()++`` method instead of the ``++values++`` attribute to get a NumPy representation of the ``++DataFrame++``.

=== Code examples

==== Noncompliant code example

[source,python,diff-id=1,diff-type=noncompliant]
----
import pandas as pd

df = pd.DataFrame({
        'X': ['A', 'B', 'A', 'C'],
        'Y': [10, 7, 12, 5]
    })

arr = df.values # Noncompliant: using the 'values' attribute is not recommended
----

==== Compliant solution

[source,python,diff-id=1,diff-type=compliant]
----
import pandas as pd

df = pd.DataFrame({
        'X': ['A', 'B', 'A', 'C'],
        'Y': [10, 7, 12, 5]
    })

arr = df.to_numpy() # Compliant
----


== Resources
=== Documentation

* Pandas Documentation - https://pandas.pydata.org/pandas-docs/version/2.1/reference/api/pandas.DataFrame.to_numpy.html[pandas.DataFrame.to_numpy()]
* Pandas Documentation - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.values.html[pandas.DataFrame.values]
Create rule S6741: The 'pandas.DataFrame.to_numpy()' method should be preferred to the 'pandas.DataFrame.values' attribute (#2992) You can preview this rule [here](https://sonarsource.github.io/rspec/#/rspec/S6741/python) (updated a few minutes after each push). ## Review A dedicated reviewer checked the rule description successfully for: - [ ] logical errors and incorrect information - [ ] information gaps and missing content - [ ] text style and tone - [ ] PR summary and labels follow [the guidelines](https://github.com/SonarSource/rspec/#to-modify-an-existing-rule) --------- Co-authored-by: maksim-grebeniuk-sonarsource <maksim-grebeniuk-sonarsource@users.noreply.github.com> Co-authored-by: Maksim Grebeniuk <maksim.grebeniuk@sonarsource.com> Co-authored-by: Guillaume Dequenne <guillaume.dequenne@sonarsource.com> 2023-10-06 11:49:46 +02:00			This rule raises an issue when the ``++pandas.DataFrame.values++`` is used instead of the ``++pandas.DataFrame.to_numpy()++`` method.

			`== Why is this an issue?`

			The ``++values++`` attribute and the ``++to_numpy()++`` method in pandas both provide a way to return a NumPy representation of the ``++DataFrame++``. However, there are some reasons why the ``++to_numpy()++`` method is recommended over the ``++values++`` attribute:

			`* Future Compatibility:`
			The ``++values++`` attribute is considered a legacy feature, while the ``++to_numpy()++`` is the recommended method to extract data and is considered more future-proof.
			`* Data type consistency:`
			If the ``++DataFrame++`` has columns with different data types, NumPy will choose a common data type that can hold all the data. This may lead to loss of information, unexpected type conversions, or increased memory usage. The ``++to_numpy()++`` allows you to select the common type manually, passing the ``++dtype++`` argument.
			`* View vs Copy:`
			The ``++values++`` attribute can return a view or a copy of the data depending on whether the data needs to be transposed. This can lead to confusion when modifying the extracted data. On the other hand, ``++to_numpy()++`` has ``++copy++`` argument allowing to force it always to return a new NumPy array, ensuring that any changes you make won't affect the original ``++DataFrame++``.
			`* Missing values control:`
			The ``++to_numpy()++`` allows to specify the default value used for missing values in the ``++DataFrame++``, while the ``++values++`` will always use ``++numpy.nan++`` for missing values.

			`== How to fix it`
			Use the ``++to_numpy()++`` method instead of the ``++values++`` attribute to get a NumPy representation of the ``++DataFrame++``.

			`=== Code examples`

			`==== Noncompliant code example`

			`[source,python,diff-id=1,diff-type=noncompliant]`
			`----`
			`import pandas as pd`

			`df = pd.DataFrame({`
			`'X': ['A', 'B', 'A', 'C'],`
			`'Y': [10, 7, 12, 5]`
			`})`

			`arr = df.values # Noncompliant: using the 'values' attribute is not recommended`
			`----`

			`==== Compliant solution`

			`[source,python,diff-id=1,diff-type=compliant]`
			`----`
			`import pandas as pd`

			`df = pd.DataFrame({`
			`'X': ['A', 'B', 'A', 'C'],`
			`'Y': [10, 7, 12, 5]`
			`})`

			`arr = df.to_numpy() # Compliant`
			`----`


			`== Resources`
			`=== Documentation`

Modify rule S6741: Fix broken pandas docs link (#3568) 2024-01-30 11:53:35 +01:00			`* Pandas Documentation - https://pandas.pydata.org/pandas-docs/version/2.1/reference/api/pandas.DataFrame.to_numpy.html[pandas.DataFrame.to_numpy()]`
Create rule S6741: The 'pandas.DataFrame.to_numpy()' method should be preferred to the 'pandas.DataFrame.values' attribute (#2992) You can preview this rule [here](https://sonarsource.github.io/rspec/#/rspec/S6741/python) (updated a few minutes after each push). ## Review A dedicated reviewer checked the rule description successfully for: - [ ] logical errors and incorrect information - [ ] information gaps and missing content - [ ] text style and tone - [ ] PR summary and labels follow [the guidelines](https://github.com/SonarSource/rspec/#to-modify-an-existing-rule) --------- Co-authored-by: maksim-grebeniuk-sonarsource <maksim-grebeniuk-sonarsource@users.noreply.github.com> Co-authored-by: Maksim Grebeniuk <maksim.grebeniuk@sonarsource.com> Co-authored-by: Guillaume Dequenne <guillaume.dequenne@sonarsource.com> 2023-10-06 11:49:46 +02:00			`* Pandas Documentation - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.values.html[pandas.DataFrame.values]`