ImplementationNotes - GateNLP/gateplugin-LearningFramework GitHub Wiki

Implementation / Development Notes

Overview of classification/regression and sequence tagging

The main abstractions in the code are:

Training

Needed:

Feature specification and data directory
Instance annotations, optional additional annotations specified in the feature specification
If the algorithm is sequence tagging algorithm, the sequence annotation

Main steps:

if we have a known sequence tagging algorithm (currently Mallet seq only) check that the SequenceSpan type is specified, otherwise that it is not specified
Read the feature specification
create the engine for the algorithm selected using `Engine.createEngine(algorithm,parms,featureInfo,targetType,dataDirectory)
get the corpus representation from the engine
for each document
- (add the internal class feature, if we have classification and if necessary)
- send all instance (and sequence) annotations to the corpus representation (corpusRepresentation.add)
- TODO: this should change so that the annotations get sent to the engine instead in order to allow for more complex learning strategies
finish processing of the data (call corpus representation finish method): for any re-scaling etc
- however, for dense OOM representations, the scaling is delegated to the wrapper that converts the original representation to numeric representation
- TODO: put this on the engine so that the processing can be properly delegated
- gather information to get saved in the info file (the info data is part of the engine)
- call engine.saveEngine(datadir) (TODO: engine already knows directory, could make parm-less)
call the engine.trainModel method
- for some engines, this means that the in-memory representation will first get exported to run an external command for training, others use the prepared in-memory representation or the prepared on-disk file

The main steps are almost the same for sequence tagging, classification or regression:

if the problem is chunking, a seq-encoder must be specified
if an algorithm is a sequence-tagging algorithm:
- a sequence span annotation type is needed so sequences can be created
- if the problem is classification

Application

Main steps:

if server URL is specified, just use that. For this we need the info file
- TODO: make it possible to have different ways to interact with the server, may need another parm for that
create the Engine using Engine.loadEngine(dataDir,algParms)
- As part of this, also load feature info, target type etc. and recreate the corpus representation (in most cases this is a Mallet corpus representation which includes our own subclass of Pipe, which allows us to preserve all we need to convert annotations to features/attributes)
for each document, call engine.applyModel(instanceAS,inputAS,sequenceAS,parms) This creates a sequence of ModelApplication objects which are used to actually modify the document (either create new annotations or put the class on the existing instance annotation)
for some engines, engine.applyModel really sends a representation of the instances to a process or server and gets back the classification from there

(Mallet) Sparse Feature vectors

For classification and regression, the independent features are implemented as a Mallet FeatureVector object. Attribute names as generated by the FeatureExtraction class are mapped to indices in the feature vector using the data alphabet of the pipe.

FeatureVector instances always use sparse, non-binary representation. This means that all values which are zero are not actually stored in the instance, instead the instance keeps track of how many locations are actually used and maps location numbers to indices.

To get all the non-zero values of a feature vector and their indices (sparse representation):

FeatureVector fv = (FeatureVector)instance.getData();
for(int loc=0;loc<fv.numLocations(); loc++) {
  int index = fv.indexAtLocation(loc);
  double value = valueAtLocation(loc);
}

To get all values of the vector:

int nrFeatures = pipe.getDataAlphabet().size();
FeatureVector fv = (FeatureVector)instance.getData();
for(int index=0; index<nrFeatures; index++) {
  double value = fv.value(index);
}

Notes:

Sparse FeatureVector objects do not know about the "true" size of the sparse vector.
FeatureVector.location(index) returns the location of the index-th dimension if non-zero and -1 for zero (non-stored) locations.
FeatureVector.value(index) returns the value of that index, 0.0 if not any non-stored location (irrespective of true size)
FeatureVector.valueAtLocation(location) returns the value at that location or throws an exception if location does not exist

(Mallet) Sparse Instance Targets

We distinguish two tasks: classification and regression: for classification, the target alphabet will be an instance of a LabelAlphabet, for regression it will be null;

For classification, the target of each instance is:

a String for ordinary classification
an instance of NominalTargetWithCosts for classification where we have a cost vector for each instance
a Double for regression

For classification, to get the actual String label of an instance:

LabelAlphabet la = (LabelAlphabet)pipe.getTargetAlphabet(); 
Object target = instance.getTarget();
Label l = la.getLabel(target);
// if this is ordinary classification, the entry for the label should be a String
String targetString = (String)l.getEntry();
// if this is classification with per-instance cost vectors, the entry for the label is a NominalTargetWithCosts instance:
NominalTargetWithCosts ntwc = (NominalTargetWithCosts)l.getEntry();
String targetString = ntwc.getClassLabel();
double[] costs = ntwc.getCosts();

Dense Feature Vectors and Instances

TBD

Problems/TODO:

instances are always stored in memory which is not feasible for very large corpora (except for wrappers to dense NN-learning at the moment)
To support OOM export for e.g. Weka, we would need to know the header of the arff file first which is not possible For this, we need to export a temporary file, then export the header, then append the data to the header (unless weka supports some other format where the header/metadata can be separate from the data)
If we always separate exporting from training, even with internal algorithms, we may get a cleaner implementation for experimenting.