rspec/rules/S6741/python/rule.adoc
2024-01-30 11:53:35 +01:00

55 lines
2.5 KiB
Plaintext

This rule raises an issue when the ``++pandas.DataFrame.values++`` is used instead of the ``++pandas.DataFrame.to_numpy()++`` method.
== Why is this an issue?
The ``++values++`` attribute and the ``++to_numpy()++`` method in pandas both provide a way to return a NumPy representation of the ``++DataFrame++``. However, there are some reasons why the ``++to_numpy()++`` method is recommended over the ``++values++`` attribute:
* *Future Compatibility:*
The ``++values++`` attribute is considered a legacy feature, while the ``++to_numpy()++`` is the recommended method to extract data and is considered more future-proof.
* *Data type consistency:*
If the ``++DataFrame++`` has columns with different data types, NumPy will choose a common data type that can hold all the data. This may lead to loss of information, unexpected type conversions, or increased memory usage. The ``++to_numpy()++`` allows you to select the common type manually, passing the ``++dtype++`` argument.
* *View vs Copy:*
The ``++values++`` attribute can return a view or a copy of the data depending on whether the data needs to be transposed. This can lead to confusion when modifying the extracted data. On the other hand, ``++to_numpy()++`` has ``++copy++`` argument allowing to force it always to return a new NumPy array, ensuring that any changes you make won't affect the original ``++DataFrame++``.
* *Missing values control:*
The ``++to_numpy()++`` allows to specify the default value used for missing values in the ``++DataFrame++``, while the ``++values++`` will always use ``++numpy.nan++`` for missing values.
== How to fix it
Use the ``++to_numpy()++`` method instead of the ``++values++`` attribute to get a NumPy representation of the ``++DataFrame++``.
=== Code examples
==== Noncompliant code example
[source,python,diff-id=1,diff-type=noncompliant]
----
import pandas as pd
df = pd.DataFrame({
'X': ['A', 'B', 'A', 'C'],
'Y': [10, 7, 12, 5]
})
arr = df.values # Noncompliant: using the 'values' attribute is not recommended
----
==== Compliant solution
[source,python,diff-id=1,diff-type=compliant]
----
import pandas as pd
df = pd.DataFrame({
'X': ['A', 'B', 'A', 'C'],
'Y': [10, 7, 12, 5]
})
arr = df.to_numpy() # Compliant
----
== Resources
=== Documentation
* Pandas Documentation - https://pandas.pydata.org/pandas-docs/version/2.1/reference/api/pandas.DataFrame.to_numpy.html[pandas.DataFrame.to_numpy()]
* Pandas Documentation - https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.values.html[pandas.DataFrame.values]