rspec/rules/S6969/python/rule.adoc

59 lines
2.2 KiB
Plaintext

This rule raises an issue when a Scikit-Learn Pipeline is created without specifying the `memory` argument.
== Why is this an issue?
When the `memory` argument is not specified, the pipeline will recompute the transformers every time the pipeline is fitted.
This can be time-consuming if the transformers are expensive to compute or if the dataset is large.
However, if the intent is to recompute the transformers everytime, the memory argument should be set explicitly to `None`. This way the intention is clear.
== How to fix it
Specify the `memory` argument when creating a Scikit-Learn Pipeline.
=== Code examples
==== Noncompliant code example
[source,python,diff-id=1,diff-type=noncompliant]
----
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LinearDiscriminantAnalysis())
]) # Noncompliant: the memory parameter is not provided
----
==== Compliant solution
[source,python,diff-id=1,diff-type=compliant]
----
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
pipeline = Pipeline([
('scaler', StandardScaler()),
('classifier', LinearDiscriminantAnalysis())
], memory="cache_folder") # Compliant
----
=== Pitfalls
If the pipeline is used with different datasets, the cache may not be helpful and can consume a lot of space.
This is true when using `sklearn.model_selection.HalvingGridSearchCV` or `sklearn.model_selection.HalvingRandomSearchCV` because the size of the dataset changes every iteration when using the default configuration.
ifdef::env-github,rspecator-view[]
== Implementation specification
Check if the parameter is provided without checking the value.
Might be able to make a quickfix to add `memory=None` to make it less painful to fix.
Issue location : primary location on `Pipeline`.
endif::env-github,rspecator-view[]
== Resources
=== Documentation
* Scikit-Learn documentation - https://scikit-learn.org/stable/modules/compose.html#caching-transformers-avoid-repeated-computation[Pipeline]