simplify01 DesignNotes - GateNLP/gate-lf-python-data GitHub Wiki

Instance representation

  • Instances should be dictionaries instead of lists!
    • can add new features from additional transforms (e.g. add char embedding)
    • use feature name conventions e.g. if we pre-compute the mask for "f_1" that is is "f_1_mask"
  • Allennlp uses classes for everything (Field, Instance)
    • Advantage: we have everything we need to know ABOUT the data directly available
    • Advantage: we can have class-specific methods, yay OO!
    • Disadvantage: not Python-y, need to remember lots of API
    • Alternative: "need-to-know" basis only - most of the time we simply do not care what something is. Instances are just maps, handle how we use or process them by convention and possibly differently depending on situation. The metadata is stored elsewhere, connected through just the feature name (??)

Batch representation

  • Batch representation should be what the PytorchDataloader does by default with instances that are dictionaries: have a dictionary of batches for each original key. This can be made to avoid creating tensors if we provide our own function for collating.
  • PADDING/MASKING: this needs to be done by our own collate function in the dataloader!
    • NOTE:we even need to pad if we use the Pytorch padded sequence stuff, but we can already pass on the mask or sequence lengths to make it easuer to do that in the Module!

Dictionaries/vocab/embeddings

Hyperparameter search, experiment logging, experiment planning etc.

  • check if we should use/support some AutoML (which)

Misc Notes