Installation Instructions - wtepfenhart/Link-Analysis GitHub Wiki
NOTE THIS WIKI AND THE CODING PROJECT IT DESCRIBES ARE STILL UNDER DEVELOPMENT
This project requires three open-source applications to run--XOWA, The Stanford CoreNLP, and Neo4j. All three applications are available for Windows, Linux, and Mac OS X, although the authors of this project use the Linux compatible distributions and Bash scripts for certain tasks. Finally, Java 8+ is required to run both our applications and the open source software. Links to the download pages and Git repositories are provided below:
Clone this repository into a new folder (we called ours Research, so we will now refer to any content from our repository as Research/Link-Analysis). For ease of use, we will assume that all required software is also saved into the Research folder in appropriate sub-directories. For example, we assume that all content from the XOWA download is in the folder Research/XOWA, content from the CoreNLP is in the folder Research/CoreNLP, etc. etc.
After you have installed all the required software follow the instructions to download English Wikipedia through the XOWA app. We only require the articles to be downloaded, not the images.
Once the English Wikipedia is installed run XOWA in server mode. This is done with the following command:
java -jar xowa_linux_64.jar --app_mode http_server
XOWA can be launched in server mode in Mac and Windows environments as well, although the name of the JAR file the user runs should be changed accordingly. Additionally, it is useful to note that server mode can be launched with a different http port. This is done by adding an additional tag.
java -jar xowa_linux_64.jar --app_mode http_server --http_server_port 8008
While XOWA server mode is running, launch a new terminal window and navigate to Research/Link-Analysis. Create a new text file with a comma separated list of people, places, organizations, and events that you would like to retrieve Wikipedia articles for. Save the file as input.txt and run the following command:
java -jar SixDegSearch/SixDeg_10_5_2018.jar --file input.txt
Note: there is not default input or output file names. If you do not specify the input with the --file
tag you will be prompted for the input file on the command line. Similarly, if you do not specify the output with the --output
tag you will be prompted for the path of the output file. If the path to the output file does not exist currently, one will be created. We recommend calling this output file search_results.txt
.
By default our program will listen to localhost 8080; a user can change this with the --http_server_port
tag as shown above with the xowa_linux_64.jar
. Also, you can specify a different wiki to search, assuming that you have the Wiki downloaded through XOWA, with the --wiki
tag. Our program is currently only working for articles in English. The output file of the Six Degree Search will be used as the input file for the next program, the Element Scraper.
Using the Element Scraper is simple. Make sure you still have XOWA running in server mode, and simply type the following into a command line:
java -jar ElementScraper/ElementScraper.jar
The program will then prompt you for two inputs. The first input should be a search_results.txt
file. The second is a directory. We will use our convention that we described above and name this directory Research/Workspace; if it does not already exist, our program will create it.
The Workspace folder features two major sub-directories: Workspace/Lists and Workspace/Paragraphs. The Lists directory is initially empty, but will later be populated with a script. The Paragraphs directory will contain a directory for each article that was searched. Each of these directories will have two sub-directories: <article>/text_paragraphs and <article>/xml_paragraphs, where <article> is an arbitrary article selected in the Six Degree Search.
The Element Scraper will generate text paragraphs from the Wikipedia articles that were selected by the Six Degree Search and save them in the <article>/text_paragraphs folder per article.
The next step is analyzing each of the text files by the CoreNLP. We have a bash script written to easily examine each text file for every article in the Research/Workspace directory, and place the XML output in the corresponding <article>/xml_paragraphs folder.
A possible link from Core NLP and the Neo4j graphing database can be established using the GraphAware framework. GraphAware provides runtime environments for deployment as well as several modules that enhance the Neo4j experience. The following GraphAware products are necessary to build the pipeline. These products are .jar
files that will be used as plugins:
nlp: graphaware-nlp-3.4.9.52.15.jar
nlp-standford-nlp: nlp-stanfordnlp-3.4.9.52.15.jar
framework-server-community: graphaware-server-community-all-3.4.9.52.jar
You will also need the jar file obtained from the Core NLP site: https://stanfordnlp.github.io/CoreNLP/#download
stanford-english-corenlp-2018-10-05-models.jar
Once you have downloaded Neo4j desktop, open the application and create a new project. Once you give the project a name select to create a new local graph and give it a name and a password and set the version number to 3.4.9.
Once the graph is created, we need to configure the settings to be compatible with the .jar
files we previously downloaded. Select 'Manage' on the graph you just created and navigate to the 'Settings' tab. This is essentially the configuration file for your database. Add the following lines of code to the configuration file:
dbms.unmanaged_extension_classes=com.graphaware.server=/graphaware
com.graphaware.runtime.enabled=true
com.graphaware.module.NLP.1=com.graphaware.nlp.module.NLPBootstrapper
dbms.security.procedures.whitelist=ga.nlp.*
Also, you will have to change the heap size as you increase the amount of nodes in the graph. Find the line in the configuration file that looks like:
dbms.memory.heap.max_size
I set mine to '3GB' for now.
Next you have to put in the four previously mentioned .jar
files into the plugins folder for your database. Under the project name, select the 'Open Folder' dropdown menu and select 'Plugins.' Copy the four .jar
files and close the directory. Note: Make sure you have read and write privileges on all files.
To run the graph, select the 'Run' button underneath your project name. Once your graph is running select the 'Open Browser' button to begin interaction with your database.
A schema is not required for Neo4j databases, however, we will implement our own here. We will add some constraints in order to make sure the incoming data is without duplicates. Therefore, we will make constraints on the annotated text, tags, and sentences such that they are unique. This will assert that no duplicate "articles" will be uploaded into the database. Next, we will create an index on tags so that searching by tag (word in an article) will be more efficient. Below is the Cypher statement that creates this schema:
CREATE CONSTRAINT ON (n:AnnotatedText) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:Tag) ASSERT n.id IS UNIQUE;
CREATE CONSTRAINT ON (n:Sentence) ASSERT n.id IS UNIQUE;
CREATE INDEX ON :Tag(value);
To a get a quick overview of the capabilities, enter the following Cypher command to view documentation:
CALL dbms.procedures() YIELD name, signature, description
WHERE name =~ 'ga.nlp.*'
RETURN name, signature, description ORDER BY name asc;