Project Overview - wtepfenhart/Link-Analysis GitHub Wiki

This page provides a more detailed overview of the project as a whole including the current status of the project, modifications and advances we plan on making, and how these modifications fit in to the main goals of the project.

Overview

Link-Analysis is a project for data mining Wikipedia for relationships between people, locations, events, and organizations. In our research we utilize several open source applications. We use XOWA to download and retrieve articles from Wikipedia, the Stanford CoreNLP to analyze the text of these articles, and Neo4j to store data from the analyzed text.

Two of our applications, Six Degree Search and Element Scraper, aid in the retrieval and parsing of Wikipedia articles into paragraphs, which are then analyzed by the CoreNLP. We are currently constructing applications that parse the output files of the CoreNLP and prepare CSV files that can be used to import data into Neo4j.

The CoreNLP provides XML output to the text files that it analyzes; we run the XML files through our program XML-to-CSV. This program parses the data provided by the CoreNLP and formats aspects of it into a CSV file, which can then be used to import nodes and relationships into Neo4j. Currently, our program can parse and combine tokens that the CoreNLP NER annotator flagged as a "named entity" (i.e. proper nouns that are classified as a PERSON, ORGANIZATION, or LOCATION). From the flagged groups of tokens we make three separate CSV files, one for each entity classification, which are properly formatted to be imported as nodes in Neo4j. We are working on providing functions that can build relationships between these named entities.