59 lines
2.2 KiB
Plaintext
59 lines
2.2 KiB
Plaintext
This rule raises an issue when a Scikit-Learn Pipeline is created without specifying the `memory` argument.
|
|
|
|
== Why is this an issue?
|
|
|
|
When the `memory` argument is not specified, the pipeline will recompute the transformers every time the pipeline is fitted.
|
|
This can be time-consuming if the transformers are expensive to compute or if the dataset is large.
|
|
|
|
However, if the intent is to recompute the transformers everytime, the memory argument should be set explicitly to `None`. This way the intention is clear.
|
|
|
|
== How to fix it
|
|
Specify the `memory` argument when creating a Scikit-Learn Pipeline.
|
|
|
|
=== Code examples
|
|
|
|
==== Noncompliant code example
|
|
|
|
[source,python,diff-id=1,diff-type=noncompliant]
|
|
----
|
|
from sklearn.pipeline import Pipeline
|
|
from sklearn.preprocessing import StandardScaler
|
|
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
|
|
|
|
pipeline = Pipeline([
|
|
('scaler', StandardScaler()),
|
|
('classifier', LinearDiscriminantAnalysis())
|
|
]) # Noncompliant: the memory parameter is not provided
|
|
----
|
|
|
|
==== Compliant solution
|
|
|
|
[source,python,diff-id=1,diff-type=compliant]
|
|
----
|
|
from sklearn.pipeline import Pipeline
|
|
from sklearn.preprocessing import StandardScaler
|
|
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
|
|
|
|
pipeline = Pipeline([
|
|
('scaler', StandardScaler()),
|
|
('classifier', LinearDiscriminantAnalysis())
|
|
], memory="cache_folder") # Compliant
|
|
----
|
|
|
|
|
|
=== Pitfalls
|
|
If the pipeline is used with different datasets, the cache may not be helpful and can consume a lot of space.
|
|
This is true when using `sklearn.model_selection.HalvingGridSearchCV` or `sklearn.model_selection.HalvingRandomSearchCV` because the size of the dataset changes every iteration when using the default configuration.
|
|
|
|
ifdef::env-github,rspecator-view[]
|
|
== Implementation specification
|
|
Check if the parameter is provided without checking the value.
|
|
Might be able to make a quickfix to add `memory=None` to make it less painful to fix.
|
|
|
|
Issue location : primary location on `Pipeline`.
|
|
|
|
endif::env-github,rspecator-view[]
|
|
|
|
== Resources
|
|
=== Documentation
|
|
* Scikit-Learn documentation - https://scikit-learn.org/stable/modules/compose.html#caching-transformers-avoid-repeated-computation[Pipeline] |