Data Basics - subramen/pytorch GitHub Wiki
Introduction to DataLoader and Dataset
Read through link
Common Object in DataLoader
- Sampler: Randomly choosing index per iteration. It would yield indices when
batch_size
is notNone
.- For
IterableDataset
, it would keep yielding None(s) per iteration using_InfiniteConstantSampler
- For
- Fetcher: Taking single index or a batch of indices, and returning corresponding data from Dataset. It would invoke
collate_fn
over each batch of data and drop the remaining unfilled batch ifdrop_last
is set.- For
IterableDataset
, it would simply get next batch-size elements as a batch.
- For
Data/Control flow in DataLoader
- Single Process:
Sampler
|
index/indices
|
V
Fetcher
|
index/indices
|
V
dataset
|
V
collate_fn
|
V
output
- Multiple processes:
Sampler (Main process)
|
index/indices
|
V
Index Multiprocessing Queue (one healthy worker)
|
index/indices
|
V
Fetcher (Worker process)
|
index/indices
|
V
dataset
|
Batch of data
|
V
collate_fn
|
V
Result Multiprocessing Queue
|
Data
|
V
pin_memory_thread (Main process)
|
V
output
This is just a general data and control flow in DataLoader. There are multiple further detailed functionalities like prefetching, worker_status, and etc.
Common gotchas for DataLoader
Most of common questions for DataLoader come from multiple workers as multiprocessing is enabled.
- Default multiprocessing methods are different across platforms based on Python (https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods)
- Control randomness per worker using
worker_init_fn
. Otherwise, DataLoader either becomes non-deterministic when using spawn or shares same random state for each worker when using fork. - COW in fork (Copy-on-access in Python fork). The simplest solution for the implementation of Dataset is to use Tensor or NumPy array to replace Python arbitrary objects like list and dict.
- Control randomness per worker using
- Difference between Map-style Datset and Iterable-style Dataset
- Map-style Dataset can utilize the indices sampled from main process to get automatic sharding.
- Iterable-style Dataset requires users to manually implement sharding inside
__iter__
method usingtorch.utils.data.get_worker_info()
. Please check the example.
- Shuffle is not enabled for Iterable-style Dataset. If needed, users need to implement the shuffle utilities inside
IterableDataset
class. (This is solved by TorchData project)
Introduction to next-generation Data API (TorchData)
Read through link and link Expected features:
- Automatic/Dydamic sharding
- Determinism Control
- Snapshotting
- DataFrame integration
- etc.
Lab for DataLoader and DataPipe
Goto N1222094 for Data Lab