Processing resources and modules - gsi-upm/SAGA GitHub Wiki

First of all, we need to clarify what a processing resource and a module are.

A processing resource will perform a single task over a corpus, such as tokenizing text, tagging words, delete tags, counting words in a text and so on.

A module is just a concatenation of some processing resources that will be executed one by one and each processing resources will use the output of the previous one.

Modules

Dictionary Based Information Extractor

This is the base module that we are going to use when performing sentiment analysis based on dictionaries. This module let us set (in its java constructor) the dictionaries we want to use in the gazetteer processing resource that includes.

It includes the following ANNIE processing resources:

  1. ANNIE Annotation Delete PR: used to clean up the documents in the corpus in case they were processed with another module.
  2. ANNIE Tokeniser PR: set and aplie the rules of text tokenization ir order to separate the text in the words that form it.
  3. ANNIE Gazetteer PR seted with the dictionaries we decide: used to identifie the different words in the text and add a lookup annotation to them. The dictionaries used should be at least and be structured as:
    • /src/resources/gazetteer/topic/negative.lts
    • /src/resources/gazetteer/topic/positive.lts
    • /src/resources/gazetteer/topic/lists.def Where topic is the topic of the dictionaries we are using: finances, sports, music...
  4. ANNIE Transducer PR: used to transform the lookup annotations into its own annotations using jape rules. (/src/resources/jape/main.jape)

It is really important that among the dictionaries used exists at least two dictionaries called positive and negative, because future processing resources will make use of the annotations with the same name.

Dictionary Based Sentiment Analyzer

This module extends the Dictionary Based Information Extractor adding two more processing resources to it. The purpose of this module is to determine the sentiment value and the polarity of the given documents in a corpus.

The processing resources added are:

  1. Text Value And Polarity Generator: used when the module is executed in GATE Developer or in a web service.
  2. Word Value And Polarity Generator: used when the module is executed in a web service.

Machine Learning Based Analyzer

This is the base module that we are going to use when performing sentiment analysis based on machine learning applications. This module let us set (in its java constructor), among other things, the file paum.xml and the run mode we want to use in the Batch Learning processing resource that includes. The run mode can be TRAINING (train the learner to generate a model), APPLICATION (applies the model over a corpus) and EVALUATION (for cross-validation) and paum.xml is a file used to configure the learner.

It includes the following ANNIE processing resources:

  1. ANNIE Annotation Delete PR: used to clean up the documents in the corpus in case they were processed with another module.
  2. ANNIE Annotation Set Transfer PR: used in training or evaluation mode, transfers annotations that we want to learn about from a set to another.
  3. ANNIE Transducer PR: used in application mode to prepare lookup annotations of a pre analyzed corpus.
    • /src/resources/machineLearning/topic/copy_comment_spans.jape_
  4. ANNIE Tokeniser PR: sets and applies the rules of text tokenization ir order to separate the text in the words that form it.
  5. ANNIE Sentence Splitter PR: sets and applies the rules of sentence tokenization ir order to separate the text into sentences.
  6. ANNIE POS Tagger PR: prepares annotations.
  7. GATE Morphological Analyser PR: prepares annotations.
  8. GATE Batch PR: seted with the paum.xml (model) and RunMode we decide: used to train a learner, use a model o realize a cross-validation over a corpus. This is a Machine Learning PR.
    • /src/resources/machineLearning/topic/paum.xml
    • /src/resources/machineLearning/topic/corpora

Where topic is the topic of the corpora we are using: reviews, finances, sports, music...

Processing resources

Text Value and Polarity Generator

Counts the positive and negative annotations in a given document and returns the following information:

  1. A value between -1 (negative) and 1 (positive), using the following trivial algorithm: sentiment_value = (positiveAnnotations - negativeAnnotations)/(positiveAnnotations + negativeAnnotations)
  2. The polarity of the text: positive, negative or neutral.

The implementation of this processing resource can be found at gate.sa.modules/src/processingResources/TextValueAndPolarityGenerator.java

Word Value And Polarity Generator

Stores in an array every positive or negative annotation with the following structure:

  1. Annotation, usually a word.
  2. Value: -1 if is negative or 1 if is positive.
  3. Polarity: positive or negative.
  4. Initial position in the text.
  5. Final position in the text.

This array will be used when this processing resource is called from a web service in order to make some operations with it such as marl generation.

The implementation of this processing resource can be found at gate.sa.modules/src/webProcessingResources/WordValueAndPolarityGenerator.java