simplify01 DesignNotes - GateNLP/gate-lf-python-data GitHub Wiki
Instance representation
Instances should be dictionaries instead of lists!
can add new features from additional transforms (e.g. add char embedding)
use feature name conventions e.g. if we pre-compute the mask for "f_1" that is is "f_1_mask"
Allennlp uses classes for everything (Field, Instance)
Advantage: we have everything we need to know ABOUT the data directly available
Advantage: we can have class-specific methods, yay OO!
Disadvantage: not Python-y, need to remember lots of API
Alternative: "need-to-know" basis only - most of the time we simply do not care what something is.
Instances are just maps, handle how we use or process them by convention and possibly differently
depending on situation. The metadata is stored elsewhere, connected through just the feature name (??)
Batch representation
Batch representation should be what the PytorchDataloader does by default with instances that are dictionaries: have a dictionary of batches for each original key. This can be made to avoid creating tensors if we provide our own function for collating.
PADDING/MASKING: this needs to be done by our own collate function in the dataloader!
NOTE:we even need to pad if we use the Pytorch padded sequence stuff, but we can already
pass on the mask or sequence lengths to make it easuer to do that in the Module!
Dictionaries/vocab/embeddings
try to follow common conventions, make stuff from other libraries (GenSim, Sklearn, ..) at least usable.
E.g.: dictionary filtering, then filtering / replacing based on reduced dictionary