Parser Indexer Pipeline - wkiri/MTE GitHub Wiki
This page describes the following:
- How to bootstrap the MTE system, including the instructions for setting up the requirements.
- How to use MTE system to parse, including the automated NER annotation using CoreNLP.
- How to index the parsed content to Solr index.
The following are the requirements for MTE parser indexer pipeline.
Infrastructure
- Java Development Kit (JDK) 1.8 or newer
- Python 2.7 or 3.x
- Apache Maven 3.3 or newer
- git
- python pip
Note: The setup instructions are not included for the above requirements but expected to be installed by infra/sys-admin team.
Supporting modules
- Apache Solr 6.x or newer
- Apache Tika 1.13 or newer
- Grobid Parser
- Stanford CoreNLP
- Tika-CoreNLP NER addon
- jSRE for relation extraction
- Parser-indexer-py
Note: The setup instructions for the above are included in this page
Required TCP/IP ports
- 8983 - Apache solr
- 9998 - Apache tika
- 8080 - Grobid service
- 9000 - Stanford CoreNLP
Note: These are the default ports. Although it is possible to map them to different ports, this page doesn't include how to customize them.
Depending on your Operating system, please install
- Java Development Kit (JDK) 1.8
- Python
- git
- Apache Maven 3.3 or newer
- Python pip
Let us create a directory for MTE to place all the modules
export MTE_HOME=/proj/mte
mkdir $MTE_HOME
Get Parser-indexer-py
cd $MTE_HOME
git clone --recursive https://github.com/USCDataScience/parser-indexer-py.git
# Note: --recursive will clone all the sub modules
# We have to build this project, it will be shown in the next section
Get Apache Solr and create a Solr core
cd $MTE_HOME
wget http://archive.apache.org/dist/lucene/solr/6.1.0/solr-6.1.0.tgz
tar xvzf solr-6.1.0.tgz
mv solr-6.1.0 solr
cd solr
PORT=8983
bin/solr start -p $PORT
# 'docs' core is for production
bin/solr create_core -c docs -d $MTE_HOME/parser-indexer-py/conf/solr/docs -p $PORT
# 'docsdev' core is for development
bin/solr create_core -c docsdev -d $MTE_HOME/parser-indexer-py/conf/solr/docs -p $PORT
bin/solr stop -all
# We don't have to build this, it is ready to run
Get Stanford CoreNLP
cd $MTE_HOME`
wget http://nlp.stanford.edu/software/stanford-corenlp-full-2017-06-09.zip
mv stanford-corenlp-full-2017-06-09.zip stanford-corenlp-mte-3.8.0
# TODO: here we have to update the default model of corenlp
Add the StanfordCoreNLP.properties file, which is customized for us
Add the corenlpserver.sh file, which starts the server and points to the properties file
# We don't have to build this, it is ready to run
Get jSRE Create a directory, store it in $JSRE_HOME, and install jSRE there.
Build Grobid module
cd $MTE_HOME/parser-indexer-py/parser-server/grobid
mvn install -DskipTests
Build Tika NER CoreNLP addon
cd $MTE_HOME/parser-indexer-py/parser-server/tika-ner-corenlp
mvn clean compile && mvn install -DskipTests
Build Parser Server
cd $MTE_HOME/parser-indexer-py/parser-server
mvn clean compile assembly:single
Start Apache Solr
$MTE_HOME/solr/bin/solr start
# if a service is already running, use 'restart' command
Start Stanford CoreNLP
cd $MTE_HOME/stanford-corenlp-mte-3.8.0
nohup ./corenlpserver.sh &
Start Grobid Server
cd $MTE_HOME/parser-indexer-py/parser-server/grobid/grobid-service
nohup mvn -DskipTests jetty:run-war &
Start Parser Server
cd $MTE_HOME/parser-indexer-py/parser-server
nohup ./run.sh &
Check - are the services running? Check if processes are active
[mteuser@buffalo parser-server]$ jps -ml
68178 org.codehaus.plexus.classworlds.launcher.Launcher -DskipTests jetty:run-war
68850 sun.tools.jps.Jps -ml
63915 edu.stanford.nlp.pipeline.StanfordCoreNLPServer -serverProperties ./StanfordCoreNLP.properties
62734 start.jar --module=http
68734 /proj/mte/parser-indexer-py/parser-server/target/parser-server-1.0-SNAPSHOT-jar-with-dependencies.jar -c /proj/mte/parser-indexer-py/parser-server/src/main/resources/tika-config.xml
Check if the services are listening on its allocated ports
[mteuser@buffalo parser-server]$ lsof -i :8080 -i :9998 -i :8983 -i :9000
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
java 62734 mteuser 100u IPv4 387696260 0t0 TCP *:8983 (LISTEN)
java 63915 mteuser 21u IPv4 387735931 0t0 TCP *:cslistener (LISTEN)
java 68178 mteuser 191u IPv4 387769620 0t0 TCP *:webcache (LISTEN)
java 68734 mteuser 19u IPv4 387769693 0t0 TCP localhost:distinct32 (LISTEN)
Stopping the services Extra care has to be taken while stopping the solr. Always use the below command to stop solr gracefully.
$MTE_HOME/solr/bin/solr stop
CAUTION: Never use kill -9 on solr, it might corrupt the index.
To stop services other than solr, use kill -2 PID
or kill -9 PID
Restarting the services
To restart the solr
$MTE_HOME/solr/bin/solr restart
For services other than solr, refer to stopping and starting instructions above.
Let us create a directory for work.
export MTE_WORK=$MTE_HOME/work
mkdir $MTE_WORK
Also, install the below python modules in your virtual environment
pip install tika
pip install pycorenlp
find <pdf-dir> -name *.pdf > input-pdfs.list # create a list of pdfs to be indexed
# The models are available in this git repo (hereafter referred to as $MTE_REPO)
NER_MODEL=$MTE_REPO/trained_models/ner_model_train_62r15v3_emt_gazette.ser.gz
JSRE_MODEL=$MTE_REPO/trained_models/jSRE-lpsc15-merged-binary.model
# Parse them.
PARSER=$MTE_HOME/parser-indexer-py/src/parserindexer/parse_all.py
python $PARSER -n $NER_MODEL -j $JSRE_LOCATION -m $JSRE_MODEL -li input-pdfs.list -o out.jl
The above parser works only if all the services {Tika Server, Grobid, CoreNLP} are running as expected. It sends PDFs to Tika and Grobid for extractions. Then it uses CoreNLP to detect named entities as per the specified model $NER_MODEL and jSRE to identify relations using $JSRE_MODEL (for which you must point to your jSRE installation at $JSRE_LOCATION). The full description of the arguments to the script is as follows:
[mteuser@buffalo work]$ python $MTE_HOME/parser-indexer-py/src/parserindexer/parse_all.py -h
usage: ParseAll [-h] [-v] (-i IN | -li LIST) -o OUT [-p TIKA_URL]
[-c CORENLP_URL] [-n NER_MODEL] -j JSRE -m JSRE_MODEL
This tool can parse files.
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-i IN, --in IN Path to Input File. (default: None)
-li LIST, --list LIST
Path to a text file which contains list of input file
paths (default: None)
-o OUT, --out OUT Path to output file. (default: None)
-p TIKA_URL, --tika-url TIKA_URL
URL of Tika Server. (default: None)
-c CORENLP_URL, --corenlp-url CORENLP_URL
CoreNLP Server URL (default: http://localhost:9000)
-n NER_MODEL, --ner-model NER_MODEL
Path (on Server side) to NER model (default: None)
-j JSRE, --jsre JSRE Path to jSRE installation directory. (default: None)
-m JSRE_MODEL, --jsre-model JSRE_MODEL
Base path to jSRE models. (default: None)
Once we have obtained out.jl
from above step, we need to index to solr using the indexer.py
script.
Note:
use SOLR_URL=http://localhost:8983/solr/docs
for production
use SOLR_URL=http://localhost:8983/solr/docsdev
for development
Example:
SOLR_URL=http://localhost:8983/solr/docsdev
python $MTE_HOME/parser-indexer-py/src/parserindexer/indexer.py out.jl -s $SOLR_URL
Note:
- The
indexer.py
script uses file path to detect venue and document Id. This might not work for all other venues, it should be updated to populate venue, year, url for newer journals. - The pattern it is looking is:
lpsc
followed byyear number
followed bylpscid
in the path. Example: lpsc-14/1024.pdf
Description of arguments to indexer.py
[mteuser@buffalo work]$ python $MTE_HOME/parser-indexer-py/src/parserindexer/indexer.py -h
usage: indexer.py [-h] [-v] -i IN [-s SOLR_URL] [-sc SCHEMA]
This tool can read JSON line dump and index to solr.
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-i IN, --in IN Path to Input JSON line file. (default: None)
-s SOLR_URL, --solr-url SOLR_URL
URL of Solr core. (default:
http://localhost:8983/solr/docs)
-sc SCHEMA, --schema SCHEMA
Schema Mapping to be used. Options: ['journal',
'basic'] (default: journal)
Delete the docs (WARNING! This is deliberately NOT a clickable link, because it is destructive. Be SURE that this is what you want to do before sending this request.)
http://localhost:8983/solr/docs/update?stream.body=<delete><query>*:*</query></delete>&commit=true
Then index again:
python $MTE_HOME/parser-indexer-py/src/parserindexer/indexer.py \
-s http://localhost:8983/solr/docs \
-i out.jl