Tagger_CoreNLP - GateNLP/gateplugin-Tagger_CoreNLP GitHub Wiki

Tagger_CoreNLP Processing Resource

This processing resource annotates documents by connecting to a CoreNLP server, sending the raw text of the document (or of sections of the document) and creating annotations based on the information the server sends back.

Runtime Parameters

  • containingAnnotationType (String, no default): If this is specified, then annotations of this type and from the input annotation set are used for identifying those spans in the document which should get annotated. The PR will create and exchange one request for each span with the server. This can e.g. be used to only annotated text without the boilerplate, or only annotate text of a specific language in a mixed-language document.
  • inputAnnotationSet (String, default is empty for the default annotation set): this is only relevant if the containingAnnotationType parameter is specified in which case it is the annotation set which should contain the containing annotations.
  • outputAnnotationSet (String, default is empty for the default annotaiton set): annotation set where the new annotations will be added. See below for which annotations and features are created.
  • properties (String, no default): the properties that represent CoreNLP configuration settings to send to the CoreNLP server. See below.
  • serverUrl (String, default is http://127.0.0.1:9000/): the URL of the server endpoint where the CoreNLP server is running

Annotations created

  • Sentence: for each sentence detected by CoreNLP one annotation is created.
  • Token: for each token detected by CoreNLP one annotation is created with the following features, of which only those are present where CoreNLP created a value:
    • category: this contains the POS-tag, the value of the CoreNLP field "pos"
    • root: this contains the lemma, the value of the CoreNLP field "lemma". If a lemmatizer is not available for a language, this appears to just contain the original word form.
    • index: the token number within the sentence
    • ner: the named entity type, or "O" if no named entity
  • [TYPE]: for each possible named entity type an annotation is created that spans all tokens which have the same value of the ner feature. The concrete annotation types depend on the model.

CoreNLP properties

The properties parameter can be used to send settings to the CoreNLP server. See the CoreNLP documentation for authorative information about this.

  • annotators: a comma-separated list of annotator names, e.g. "tokenize, ssplit, pos, ner".
  • tokenize.language: the language code for tokenization, e.g. "de"
  • pos.model: the path to the model to use for doing POS tagging. E.g. for using the German model, this could be set to "edu/stanford/nlp/models/pos-tagger/german/german-hgc.tagger" (assuming the model has been installed at the default location)
  • ner.model: the path to the model to use for NER. E.g. for using the German model, this could be set to "edu/stanford/nlp/models/ner/german.hgc_175m_600.crf.ser.gz"