simoplify01 Requirements - GateNLP/gate-lf-python-data GitHub Wiki

Keep track of requirements for the simplify01 branch

  • dataset should support transforms (term used in compute rvisions, use class that implements call) to do whatever to the original data we get from the data source (e.g. the line corpus or original file corpus). Note that on the fly caching can keep the transformed data in a file per instance cache or other fast cache (possibly specifying a cache instance!)
  • basic dataset we get is line-corpus but possibly should also support document/file corpus (nested dir tree with one file per instance at the leaves).
    • One possibility is that the dataset can optionally cache instances to files automatically using a pair of functions that save the returned object to a stream/file and then read that.
    • on the fly caching would then also benefit the speed of dataloading
  • Should support server-based application
  • Ideally: support java-only application somehow? (TF? DL4J?)
  • Support using separate initial train/develop/test files or create random (stratified?) samples from one