70 lines
2.3 KiB
Plaintext
70 lines
2.3 KiB
Plaintext
![]() |
This rule raises an issue when a `torch.utils.data.Dataloader` is instantiated without specifying the `num_workers` parameter.
|
||
|
|
||
|
== Why is this an issue?
|
||
|
|
||
|
In the PyTorch library, the data loaders are used to provide an interface where common operations such as batching can be implemented.
|
||
|
It is also possible to parallelize the data loading process by using multiple worker processes.
|
||
|
This can improve performance by increasing the number of batches being fetched in parallel, at the cost of higher memory usage.
|
||
|
This performance increase can also be attributed to avoiding the Global Interpreter Lock (GIL) in the Python interpreter.
|
||
|
|
||
|
|
||
|
== How to fix it
|
||
|
Specify the `num_workers` parameter when instantiating the `torch.utils.data.Dataloader` object.
|
||
|
|
||
|
The default value of `0` will use the main process to load the data, and might be faster for small datasets that can fit completely in memory.
|
||
|
|
||
|
For larger datasets, it is recommended to use a value of `1` or higher to parallelize the data loading process.
|
||
|
|
||
|
=== Code examples
|
||
|
|
||
|
==== Noncompliant code example
|
||
|
|
||
|
[source,python,diff-id=1,diff-type=noncompliant]
|
||
|
----
|
||
|
from torch.utils.data import DataLoader
|
||
|
from torchvision import datasets
|
||
|
from torchvision.transforms import ToTensor
|
||
|
|
||
|
train_dataset = datasets.MNIST(root='data', train=True, transform=ToTensor())
|
||
|
train_data_loader = DataLoader(train_dataset, batch_size=32)# Noncompliant: the num_workers parameter is not specified
|
||
|
----
|
||
|
|
||
|
==== Compliant solution
|
||
|
|
||
|
[source,python,diff-id=1,diff-type=compliant]
|
||
|
----
|
||
|
from torch.utils.data import DataLoader
|
||
|
from torchvision import datasets
|
||
|
from torchvision.transforms import ToTensor
|
||
|
|
||
|
train_dataset = datasets.MNIST(root='data', train=True, transform=ToTensor())
|
||
|
train_data_loader = DataLoader(train_dataset, batch_size=32, num_workers=4)
|
||
|
----
|
||
|
|
||
|
== Resources
|
||
|
=== Documentation
|
||
|
|
||
|
* PyTorch documentation - https://pytorch.org/docs/stable/data.html#single-and-multi-process-data-loading[Single- and Multi-process Data Loading]
|
||
|
|
||
|
* PyTorch documentation - https://pytorch.org/tutorials/beginner/basics/data_tutorial.html[Datasets and DataLoaders]
|
||
|
|
||
|
ifdef::env-github,rspecator-view[]
|
||
|
|
||
|
(visible only on this page)
|
||
|
|
||
|
== Implementation specification
|
||
|
|
||
|
|
||
|
=== Message
|
||
|
|
||
|
Primary : Specify the `num_workers` parameter.
|
||
|
|
||
|
=== Issue location
|
||
|
|
||
|
Primary : Name of the instantiation
|
||
|
|
||
|
=== Quickfix
|
||
|
|
||
|
Fill in with the default parameter
|
||
|
|
||
|
endif::env-github,rspecator-view[]
|