Installation Instructions - wtepfenhart/Link-Analysis GitHub Wiki

NOTE THIS WIKI AND THE CODING PROJECT IT DESCRIBES ARE STILL UNDER DEVELOPMENT

Requirements

This project requires three open-source applications to run--XOWA, The Stanford CoreNLP, and Neo4j. All three applications are available for Windows, Linux, and Mac OS X, although the authors of this project use the Linux compatible distributions and Bash scripts for certain tasks. Finally, Java 8+ is required to run both our applications and the open source software. Links to the download pages and Git repositories are provided below:

XOWA

Download

Git Repository

CoreNLP

Download

Git Repository

Neo4j

Download

Git Repository

Installation

Clone this repository into a new folder (we called ours Research, so we will now refer to any content from our repository as Research/Link-Analysis). For ease of use, we will assume that all required software is also saved into the Research folder in appropriate sub-directories. For example, we assume that all content from the XOWA download is in the folder Research/XOWA, content from the CoreNLP is in the folder Research/CoreNLP, etc. etc.

After you have installed all the required software follow the instructions to download English Wikipedia through the XOWA app. We only require the articles to be downloaded, not the images.

Once the English Wikipedia is installed run XOWA in server mode. This is done with the following command:

java -jar xowa_linux_64.jar --app_mode http_server

XOWA can be launched in server mode in Mac and Windows environments as well, although the name of the JAR file the user runs should be changed accordingly. Additionally, it is useful to note that server mode can be launched with a different http port. This is done by adding an additional tag.

java -jar xowa_linux_64.jar --app_mode http_server --http_server_port 8008

While XOWA server mode is running, launch a new terminal window and navigate to Research/Link-Analysis. Create a new text file with a comma separated list of people, places, organizations, and events that you would like to retrieve Wikipedia articles for. Save the file as input.txt and run the following command:

java -jar SixDegSearch/SixDeg_10_5_2018.jar --file input.txt

Note: there is not default input or output file names. If you do not specify the input with the --file tag you will be prompted for the input file on the command line. Similarly, if you do not specify the output with the --output tag you will be prompted for the path of the output file. If the path to the output file does not exist currently, one will be created. We recommend calling this output file search_results.txt.

By default our program will listen to localhost 8080; a user can change this with the --http_server_port tag as shown above with the xowa_linux_64.jar. Also, you can specify a different wiki to search, assuming that you have the Wiki downloaded through XOWA, with the --wiki tag. Our program is currently only working for articles in English. The output file of the Six Degree Search will be used as the input file for the next program, the Element Scraper.

Using the Element Scraper is simple. Make sure you still have XOWA running in server mode, and simply type the following into a command line:

java -jar ElementScraper/ElementScraper.jar

The program will then prompt you for two inputs. The first input should be a search_results.txt file. The second is a directory. We will use our convention that we described above and name this directory Research/Workspace; if it does not already exist, our program will create it.

The Workspace folder features two major sub-directories: Workspace/Lists and Workspace/Paragraphs. The Lists directory is initially empty, but will later be populated with a script. The Paragraphs directory will contain a directory for each article that was searched. Each of these directories will have two sub-directories: <article>/text_paragraphs and <article>/xml_paragraphs, where <article> is an arbitrary article selected in the Six Degree Search.

The Element Scraper will generate text paragraphs from the Wikipedia articles that were selected by the Six Degree Search and save them in the <article>/text_paragraphs folder per article.

The next step is analyzing each of the text files by the CoreNLP. We have a bash script written to easily examine each text file for every article in the Research/Workspace directory, and place the XML output in the corresponding <article>/xml_paragraphs folder.

Stanford CoreNLP & Neo4j Setup

Prerequisites

A possible link from Core NLP and the Neo4j graphing database can be established using the GraphAware framework. GraphAware provides runtime environments for deployment as well as several modules that enhance the Neo4j experience. The following GraphAware products are necessary to build the pipeline. These products are .jar files that will be used as plugins:

nlp: graphaware-nlp-3.4.9.52.15.jar
nlp-standford-nlp: nlp-stanfordnlp-3.4.9.52.15.jar
framework-server-community: graphaware-server-community-all-3.4.9.52.jar

You will also need the jar file obtained from the Core NLP site: https://stanfordnlp.github.io/CoreNLP/#download

stanford-english-corenlp-2018-10-05-models.jar

Setting up the Graph Database

Once you have downloaded Neo4j desktop, open the application and create a new project. Once you give the project a name select to create a new local graph and give it a name and a password and set the version number to 3.4.9.

Once the graph is created, we need to configure the settings to be compatible with the .jar files we previously downloaded. Select 'Manage' on the graph you just created and navigate to the 'Settings' tab. This is essentially the configuration file for your database. Add the following lines of code to the configuration file:

dbms.unmanaged_extension_classes=com.graphaware.server=/graphaware
com.graphaware.runtime.enabled=true
com.graphaware.module.NLP.1=com.graphaware.nlp.module.NLPBootstrapper
dbms.security.procedures.whitelist=ga.nlp.*

Also, you will have to change the heap size as you increase the amount of nodes in the graph. Find the line in the configuration file that looks like:

dbms.memory.heap.max_size

I set mine to '3GB' for now.

Next you have to put in the four previously mentioned .jar files into the plugins folder for your database. Under the project name, select the 'Open Folder' dropdown menu and select 'Plugins.' Copy the four .jar files and close the directory. Note: Make sure you have read and write privileges on all files.

To run the graph, select the 'Run' button underneath your project name. Once your graph is running select the 'Open Browser' button to begin interaction with your database.

Create Schema

A schema is not required for Neo4j databases, however, we will implement our own here. We will add some constraints in order to make sure the incoming data is without duplicates. Therefore, we will make constraints on the annotated text, tags, and sentences such that they are unique. This will assert that no duplicate "articles" will be uploaded into the database. Next, we will create an index on tags so that searching by tag (word in an article) will be more efficient. Below is the Cypher statement that creates this schema:


CREATE CONSTRAINT ON (n:AnnotatedText) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:Tag) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:Sentence) ASSERT n.id IS UNIQUE;
CREATE INDEX ON :Tag(value);

Documentation

To a get a quick overview of the capabilities, enter the following Cypher command to view documentation:
CALL dbms.procedures() YIELD name, signature, description
WHERE name =~ 'ga.nlp.*'
RETURN name, signature, description ORDER BY name asc;

⚠️ **GitHub.com Fallback** ⚠️