Analytics Pipeline - GeeUnit/hw5-team07 GitHub Wiki
There are two different analytics pipelines used in the project:
edu.cmu.lti.deiis.hw5.runner.SimpleRunCPE.java
contains an executable that runs the document preparation portion of the pipeline.
Document Annotation Pipeline
The following are some of the components we developed in our Annotation Pipeline:
POS Tagging
POS Tagging was executed in order to ensure that we understand what type of answers are being sought after by any given question. For example, the question "How many students attend CMU?", should not be answered by a verb, and rather a number. POS Tagging was accomplished was used Stanford's Core NLP library.
Named Entity Extraction
The ABNER library was used to find named entities in documents. As "A Biomedical Named Entity Recognizer", ABNER excels and detecting the names of protein and gene names, making it ideal for detecting named entity detection in Alzheimer journals.
Noise Filter
The Noise Filter removes all unnecessary sentences from the document. For example, figure, and table contents are, which are OCR'd out of context are removed.
Noun Phrases
Noun phrases are used in order to locate the most relevant artifacts in a sentence (for example, the subject and target of each sentence).
Sentence Expansion
The baseline code only scored answers based on their relevance to individual sentence. However, in many cases answers to sentences are stretched across multiple sentence. Therefore, in our system, we attempted to include contextual named entities and noun phrases from surrounding sentences.
Synonym Expansion
We attempted synonym expansion so that noun phrases and synonyms additionally include terms that may not have been mentioned directly in the literature. In Synonym Expansion, we added words that could constitute synonyms to noun phrases and named entities already annotated in our document CAS.
Ngram Annotator
This annotator, creates NGram token (N = 1,2,3) for each sentences. Since each token has the POS tag then we also have the NGram POS tag. The intuition behind this annotator is to calculate the similarity based on the NGram model. However, the NGram annotator did not improve our system's performance because the annotations were too general too be applied to the current matching algorithims. More investigation into which NGrams could boost our system's performance is required. Due to time limitations, our group views this as a low priority task. However, due to the format of answers in the data set, NGrams were not included in the final pipeline (as we believed that there would be little to no overlap of NGrams).
###Dependency Parsing Dependency parsing was also accomplished via Stanford Core NLP. However, we found that Solr was not a good medium for storing dependency parse trees. Dependency parsing would be better accomplished by using a graph database, such as Neo4j.
Scoring Pipeline
edu.cmu.lti.deiis.hw5.runner.SimpleQuestionRunCPE.java
contains an executable that runs the question answer portion of our system.
For more information on the how answers are selected, please see the following pages:
- Answer Pruning
- [The Candidate Sentence Retriever] (https://github.com/GeeUnit/hw5-team07/wiki/Candidate-Answer-Retrieval)
- The Candidate Answer Scoring Strategy
- The Candidate Answer Selection Strategy