simplify01 DesignNotes - GateNLP/gate-lf-python-data GitHub Wiki

Instance representation

Instances should be dictionaries instead of lists!
- can add new features from additional transforms (e.g. add char embedding)
- use feature name conventions e.g. if we pre-compute the mask for "f_1" that is is "f_1_mask"
Allennlp uses classes for everything (Field, Instance)
- Advantage: we have everything we need to know ABOUT the data directly available
- Advantage: we can have class-specific methods, yay OO!
- Disadvantage: not Python-y, need to remember lots of API
- Alternative: "need-to-know" basis only - most of the time we simply do not care what something is. Instances are just maps, handle how we use or process them by convention and possibly differently depending on situation. The metadata is stored elsewhere, connected through just the feature name (??)

Batch representation should be what the PytorchDataloader does by default with instances that are dictionaries: have a dictionary of batches for each original key. This can be made to avoid creating tensors if we provide our own function for collating.
PADDING/MASKING: this needs to be done by our own collate function in the dataloader!
- NOTE:we even need to pad if we use the Pytorch padded sequence stuff, but we can already pass on the mask or sequence lengths to make it easuer to do that in the Module!

try to follow common conventions, make stuff from other libraries (GenSim, Sklearn, ..) at least usable. E.g.: dictionary filtering, then filtering / replacing based on reduced dictionary
Maybe make use of https://pytorchnlp.readthedocs.io/en/latest/source/torchnlp.text_encoders.html (though this goes directly to torch tensors, so not usable with the TF backend) See also http://anie.me/On-Torchtext/
torchtext has vocabs and related stuff as well: https://github.com/pytorch/text#data
Note: the replacing/filtering can be done as a transform, since we can easily add features to a dict,