Parser Indexer Pipeline - wkiri/MTE GitHub Wiki

Overview

This page describes the following:

How to bootstrap the MTE system, including the instructions for setting up the requirements.
How to use MTE system to parse, including the automated NER annotation using CoreNLP.
How to index the parsed content to Solr index.

Requirements

The following are the requirements for MTE parser indexer pipeline.

Infrastructure

Java Development Kit (JDK) 1.8 or newer
Python 2.7 or 3.x
Apache Maven 3.3 or newer
git
python pip

Note: The setup instructions are not included for the above requirements but expected to be installed by infra/sys-admin team.

Supporting modules

Note: The setup instructions for the above are included in this page

Required TCP/IP ports

8983 - Apache solr
9998 - Apache tika
8080 - Grobid service
9000 - Stanford CoreNLP

Note: These are the default ports. Although it is possible to map them to different ports, this page doesn't include how to customize them.

Setting up services

Infrastructure setup

Depending on your Operating system, please install

Java Development Kit (JDK) 1.8
Python
git
Apache Maven 3.3 or newer
Python pip

Downloading the code and packages

Let us create a directory for MTE to place all the modules

export MTE_HOME=/proj/mte
mkdir $MTE_HOME

Get Parser-indexer-py

cd $MTE_HOME
git clone --recursive https://github.com/USCDataScience/parser-indexer-py.git 
# Note: --recursive will clone all the sub modules
# We have to build this project, it will be shown in the next section

Get Apache Solr and create a Solr core

cd $MTE_HOME
wget http://archive.apache.org/dist/lucene/solr/6.1.0/solr-6.1.0.tgz
tar xvzf solr-6.1.0.tgz
mv solr-6.1.0 solr 
cd solr
PORT=8983
bin/solr start -p $PORT
# 'docs' core is for production
bin/solr create_core -c docs -d $MTE_HOME/parser-indexer-py/conf/solr/docs -p $PORT
# 'docsdev' core is for development
bin/solr create_core -c docsdev -d $MTE_HOME/parser-indexer-py/conf/solr/docs -p $PORT
bin/solr stop -all
# We don't have to build this, it is ready to run

Get Stanford CoreNLP

cd $MTE_HOME`
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2017-06-09.zip
mv stanford-corenlp-full-2017-06-09.zip stanford-corenlp-mte-3.8.0
# TODO: here we have to update the default model of corenlp
Add the StanfordCoreNLP.properties file, which is customized for us
Add the corenlpserver.sh file, which starts the server and points to the properties file
# We don't have to build this, it is ready to run

Get jSRE Create a directory, store it in $JSRE_HOME, and install jSRE there.

Building the source code

Build Grobid module

cd $MTE_HOME/parser-indexer-py/parser-server/grobid
mvn install -DskipTests

Build Tika NER CoreNLP addon

cd $MTE_HOME/parser-indexer-py/parser-server/tika-ner-corenlp
mvn clean compile && mvn install -DskipTests

Build Parser Server

cd $MTE_HOME/parser-indexer-py/parser-server
mvn clean compile assembly:single

Starting the services

Start Apache Solr

$MTE_HOME/solr/bin/solr start
# if a service is already running, use 'restart' command

Start Stanford CoreNLP

cd $MTE_HOME/stanford-corenlp-mte-3.8.0
nohup ./corenlpserver.sh &

Start Grobid Server

cd $MTE_HOME/parser-indexer-py/parser-server/grobid/grobid-service
nohup mvn -DskipTests jetty:run-war &

Start Parser Server

cd $MTE_HOME/parser-indexer-py/parser-server
nohup ./run.sh  &

Maintenance

Check - are the services running? Check if processes are active

[mteuser@buffalo parser-server]$ jps -ml
68178 org.codehaus.plexus.classworlds.launcher.Launcher -DskipTests jetty:run-war
68850 sun.tools.jps.Jps -ml
63915 edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties ./StanfordCoreNLP.properties
62734 start.jar --module=http
68734 /proj/mte/parser-indexer-py/parser-server/target/parser-server-1.0-SNAPSHOT-jar-with-dependencies.jar -c /proj/mte/parser-indexer-py/parser-server/src/main/resources/tika-config.xml

Check if the services are listening on its allocated ports

[mteuser@buffalo parser-server]$ lsof -i :8080 -i :9998 -i :8983 -i :9000
COMMAND   PID    USER   FD   TYPE    DEVICE SIZE/OFF NODE NAME
java    62734 mteuser  100u  IPv4 387696260      0t0  TCP *:8983 (LISTEN)
java    63915 mteuser   21u  IPv4 387735931      0t0  TCP *:cslistener (LISTEN)
java    68178 mteuser  191u  IPv4 387769620      0t0  TCP *:webcache (LISTEN)
java    68734 mteuser   19u  IPv4 387769693      0t0  TCP localhost:distinct32 (LISTEN)

Stopping the services Extra care has to be taken while stopping the solr. Always use the below command to stop solr gracefully.

$MTE_HOME/solr/bin/solr stop

CAUTION: Never use kill -9 on solr, it might corrupt the index.

To stop services other than solr, use kill -2 PID or kill -9 PID

Restarting the services

To restart the solr

$MTE_HOME/solr/bin/solr restart

For services other than solr, refer to stopping and starting instructions above.

Using the system

Let us create a directory for work.

export MTE_WORK=$MTE_HOME/work
mkdir $MTE_WORK

Also, install the below python modules in your virtual environment

pip install tika
pip install pycorenlp

Parse PDFs

find <pdf-dir> -name *.pdf > input-pdfs.list  # create a list of pdfs to be indexed
# The models are available in this git repo (hereafter referred to as $MTE_REPO)
NER_MODEL=$MTE_REPO/trained_models/ner_model_train_62r15v3_emt_gazette.ser.gz
JSRE_MODEL=$MTE_REPO/trained_models/jSRE-lpsc15-merged-binary.model
# Parse them.
PARSER=$MTE_HOME/parser-indexer-py/src/parserindexer/parse_all.py
python $PARSER -n $NER_MODEL -j $JSRE_LOCATION -m $JSRE_MODEL -li input-pdfs.list -o out.jl

The above parser works only if all the services {Tika Server, Grobid, CoreNLP} are running as expected. It sends PDFs to Tika and Grobid for extractions. Then it uses CoreNLP to detect named entities as per the specified model $NER_MODEL and jSRE to identify relations using $JSRE_MODEL (for which you must point to your jSRE installation at $JSRE_LOCATION). The full description of the arguments to the script is as follows:

[mteuser@buffalo work]$ python $MTE_HOME/parser-indexer-py/src/parserindexer/parse_all.py -h
usage: ParseAll [-h] [-v] (-i IN | -li LIST) -o OUT [-p TIKA_URL]
                [-c CORENLP_URL] [-n NER_MODEL] -j JSRE -m JSRE_MODEL

This tool can parse files.

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -i IN, --in IN        Path to Input File. (default: None)
  -li LIST, --list LIST
                        Path to a text file which contains list of input file
                        paths (default: None)
  -o OUT, --out OUT     Path to output file. (default: None)
  -p TIKA_URL, --tika-url TIKA_URL
                        URL of Tika Server. (default: None)
  -c CORENLP_URL, --corenlp-url CORENLP_URL
                        CoreNLP Server URL (default: http://localhost:9000)
  -n NER_MODEL, --ner-model NER_MODEL
                        Path (on Server side) to NER model (default: None)
  -j JSRE, --jsre JSRE  Path to jSRE installation directory. (default: None)
  -m JSRE_MODEL, --jsre-model JSRE_MODEL
                        Base path to jSRE models. (default: None)

Index to Solr

Once we have obtained out.jl from above step, we need to index to solr using the indexer.py script.

Note: use SOLR_URL=http://localhost:8983/solr/docs for production use SOLR_URL=http://localhost:8983/solr/docsdev for development

Example:

SOLR_URL=http://localhost:8983/solr/docsdev
python $MTE_HOME/parser-indexer-py/src/parserindexer/indexer.py out.jl -s $SOLR_URL

Note:

The indexer.py script uses file path to detect venue and document Id. This might not work for all other venues, it should be updated to populate venue, year, url for newer journals.
The pattern it is looking is: lpsc followed by year number followed by lpscid in the path. Example: lpsc-14/1024.pdf

Description of arguments to indexer.py

[mteuser@buffalo work]$ python $MTE_HOME/parser-indexer-py/src/parserindexer/indexer.py -h
usage: indexer.py [-h] [-v] -i IN [-s SOLR_URL] [-sc SCHEMA]

This tool can read JSON line dump and index to solr.

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -i IN, --in IN        Path to Input JSON line file. (default: None)
  -s SOLR_URL, --solr-url SOLR_URL
                        URL of Solr core. (default:
                        http://localhost:8983/solr/docs)
  -sc SCHEMA, --schema SCHEMA
                        Schema Mapping to be used. Options: ['journal',
                        'basic'] (default: journal)

Quick Refresh Index:

Delete the docs (WARNING! This is deliberately NOT a clickable link, because it is destructive. Be SURE that this is what you want to do before sending this request.)

http://localhost:8983/solr/docs/update?stream.body=<delete><query>*:*</query></delete>&commit=true

Then index again:

python $MTE_HOME/parser-indexer-py/src/parserindexer/indexer.py \
   -s http://localhost:8983/solr/docs \
   -i out.jl