Analysis Engine: Document - 11791-04/project-team04 GitHub Wiki
Work Flow:
- reads the question in raw String from the Question Reader
- use a unigram model to wipeout some common terms that are not meaningful
- queries the PubMed API to obtain candidate documents
- process each document (title and abstract) and query by removing punctuation, stoppers and perform Krovetz Stemming
- rank documents based on their similarity to the query.
- Several rankers were tried, including Okapi BM25, Indri, Dirichlet and XQL.
- The results are written to jcas by Document.
Outline:
descriptorimpl.DocumentRetrieval_AE
service : WebAPIServiceProxystemmer : KrovetzStemmeroutQuestions : PrintWriterbaseline : booleanconceptSet : Set<String>initialize(UimaContext)qeWithConcept(String)process(JCas)collectionProcessComplete()
descriptorimpl.DocumentRetrieval_AE.DocScoreComparator
compare(Pair<DocInfo, Double>, Pair<DocInfo, Double>)