CorpusRepresentation - GateNLP/gateplugin-LearningFramework GitHub Wiki

Implementation: CorpusRepresentation

Main functions:

  • Creating: this is currently done just using the appropriate constructor
  • Add instances: this is done using the method
    • add(isntanceAS, sequenceAS, inputAS, classAS, targetFeatureName, targetType, instanceWeightFeatuer, nameFeature, seqEncoder)
  • Finalizing: this method must be called to perform any additional steps on the whole training set after all instances have been added and before it can be used for training. The method finalize() is used for this.

Other methods are only available for the Mallet representations and can only be used from Engine instances which use the Mallet corpus representation.

Overview of how the parameters of corpusRepresentation.add are set depending on algorithm kind of problem kind:

  • All: inputAS is where any additional annotations specified in the feature specification are taken from
  • All: instanceAS is the set of annotations representing an instance for the algorithm (so for chunking it would be Token, rather than e.g. Person)
  • Classification/Regression problem: needs a targetFeatureName and type, must not have a classAS, sequenceAS
  • Sequence tagging problem: needs a classAS, must not have a targetFeatureName, featureType is NOMINAL
  • Sequence algorithm: must have SequenceAS
  • Classification problem / classification algorithm:
    • must have targetFeatureName and type
    • must not have classAS
    • must not have sequenceAS
  • Classification problem / sequence algorithm:
    • must have SequenceAS
    • must have targetFeatureName and type
    • must not have classAS
  • Sequence problem / classification algorithm:
    • must have seq-encoder
    • must have classAS
    • must not have targetFeatureName but must have type NOMINAL
    • must not have sequenceAS
  • Sequence problem / sequence algorithm:
    • must have seq-encoder
    • must have classAS
    • must have sequenceAS
    • must not have targetFeatureName but must have type NOMINAL

The TrainClassification and TrainChunking PRs already do not allow some combinations, but the Export PR is for both classification/regression and chunking,so allows to specify invalid combinations at the moment.

So in other words we have the following kinds of implications:

  • Sequence problem - implies needs classAS, needs seq-encoder
  • Classification problem - implies needs targetFeature, type=NOMINAL
  • sequence algorithm - implies requires sequenceAS, type=NOMINAL
  • classification algorithm - implies no sequenceAS

NOTE: for classification, the feature specification file allows the specification of a "WITHIN" annotation type. Although this is often similar to a sequence annotation (e.g. sentence) it is only used to restrict features to stay within a certain context, so e.g. for "ATTRIBUTELIST" the "-2" feature will not have a non-missing value at the beginning of the WITHIN annotation, but the Sequence annotation has no influence on this.