Introduction - wtepfenhart/Link-Analysis GitHub Wiki
Project Overview
This is a program that is attempting to extract new knowledge from a large suite of information. Imagine that there is information spread on many different pages of a book. On one page in the middle of the book, there is a statement that a man was carrying an umbrella to work on a cloudy day. Later in the book, there's a discussion about clouds threatening rain. Then it rains. On another page, there's a snippet about a guy running from one overhang to the next in an unsuccessful attempt to remain dry. He gets soaked. At the same time, our umbrella carrying man is walking down the rainy street with his umbrella open and is remaining dry. The story ends when the guy who didn't have an umbrella became ill as a result of exposure to cold and damp.
There's a lot of knowledge buried in that story. If you don't use an umbrella in the rain, you will get wet. If you get wet, you will get sick. If you stay dry, you will remain healthy. If you use an umbrella in the rain, you will stay dry. Conclusion: On very cloudy mornings, take an umbrella to work. It seems simple, but the keys to this logical argument can be distributed across hundreds of pages. The trick is to find them, recognize what you've found, and then extract new knowledge from it.
Our project attempts to extract data across multiple pages to build a connected knowledge base. There are multiple steps required to achieve this goal and they are described briefly below to give anyone attempting to use our program an idea of all the moving parts.
XOWA
XOWA is an open source program that allows a user to download Wikipedia. When a Wiki is downloaded using this application, a user can run it in "http server mode." This allows a user to make queries to XOWA using another program. We take advantage of this functionality with our program, the Six Degree Search.
Six Degree Search
The Six Degree Search acts as an enhanced search bar for Wikipedia articles. A user can input text file that is simply a comma separated list of people, places, organizations, and events that they want to incorporate into a knowledge base. The program uses this file to query the entirety of Wikipedia for any articles that use the keywords of the text file. Once all the articles from a set of keywords have been retrieved, a user can select which ones to keep and which ones to discard. The Six Degree Search outputs a list of addresses to articles that will be used by the Element Scraper.
Element Scraper
Using the addresses retrieved by the Six Degree Search, the Element Scraper will obtain HTML documents from XOWA. These documents contain a lot of information we wish to discard (e.g. links to technical pages, contents tables, search bars, etc.). The Element Scraper removes all the extraneous data and formats only what we need--primarily the main text of the document. Additionally, the text is separated into paragraphs, which will be analyzed by the Stanford CoreNLP.
Stanford CoreNLP
We use this open source tool to analyze the text of the Wikipedia articles we have retrieved. We chose to have the CoreNLP output data in the XML format. The data we extract is centered on named entities (i.e. proper nouns) and relationships among them.
Importing the Data
We are currently writing routines that use the XML output of the CoreNLP to make imports into a Neo4j database.
Instructions
Once you've figured out what it is supposed to do and you're still convinced that you want to do it, then you can read the page on installing it.