![github-actions[bot]](/assets/img/avatar_default.png)
* Create rule S6983 * Address review comments * Review comments 2 * Add tags --------- Co-authored-by: ghislainpiot <ghislainpiot@users.noreply.github.com> Co-authored-by: Ghislain Piot <ghislain.piot@sonarsource.com>
70 lines
2.3 KiB
Plaintext
70 lines
2.3 KiB
Plaintext
This rule raises an issue when a `torch.utils.data.Dataloader` is instantiated without specifying the `num_workers` parameter.
|
|
|
|
== Why is this an issue?
|
|
|
|
In the PyTorch library, the data loaders are used to provide an interface where common operations such as batching can be implemented.
|
|
It is also possible to parallelize the data loading process by using multiple worker processes.
|
|
This can improve performance by increasing the number of batches being fetched in parallel, at the cost of higher memory usage.
|
|
This performance increase can also be attributed to avoiding the Global Interpreter Lock (GIL) in the Python interpreter.
|
|
|
|
|
|
== How to fix it
|
|
Specify the `num_workers` parameter when instantiating the `torch.utils.data.Dataloader` object.
|
|
|
|
The default value of `0` will use the main process to load the data, and might be faster for small datasets that can fit completely in memory.
|
|
|
|
For larger datasets, it is recommended to use a value of `1` or higher to parallelize the data loading process.
|
|
|
|
=== Code examples
|
|
|
|
==== Noncompliant code example
|
|
|
|
[source,python,diff-id=1,diff-type=noncompliant]
|
|
----
|
|
from torch.utils.data import DataLoader
|
|
from torchvision import datasets
|
|
from torchvision.transforms import ToTensor
|
|
|
|
train_dataset = datasets.MNIST(root='data', train=True, transform=ToTensor())
|
|
train_data_loader = DataLoader(train_dataset, batch_size=32)# Noncompliant: the num_workers parameter is not specified
|
|
----
|
|
|
|
==== Compliant solution
|
|
|
|
[source,python,diff-id=1,diff-type=compliant]
|
|
----
|
|
from torch.utils.data import DataLoader
|
|
from torchvision import datasets
|
|
from torchvision.transforms import ToTensor
|
|
|
|
train_dataset = datasets.MNIST(root='data', train=True, transform=ToTensor())
|
|
train_data_loader = DataLoader(train_dataset, batch_size=32, num_workers=4)
|
|
----
|
|
|
|
== Resources
|
|
=== Documentation
|
|
|
|
* PyTorch documentation - https://pytorch.org/docs/stable/data.html#single-and-multi-process-data-loading[Single- and Multi-process Data Loading]
|
|
|
|
* PyTorch documentation - https://pytorch.org/tutorials/beginner/basics/data_tutorial.html[Datasets and DataLoaders]
|
|
|
|
ifdef::env-github,rspecator-view[]
|
|
|
|
(visible only on this page)
|
|
|
|
== Implementation specification
|
|
|
|
|
|
=== Message
|
|
|
|
Primary : Specify the `num_workers` parameter.
|
|
|
|
=== Issue location
|
|
|
|
Primary : Name of the instantiation
|
|
|
|
=== Quickfix
|
|
|
|
Fill in with the default parameter
|
|
|
|
endif::env-github,rspecator-view[] |