Add new documents to Solr index - wkiri/MTE GitHub Wiki
List all the PDFs to a list file
find <dir> -name *.pdf > pdfpaths.list
Parse PDFs:
$ python corenlpparser.py -h
usage: CoreNLPParser [-h] [-v] (-i IN | -li LIST) -o OUT [-p TIKA_URL]
[-c CORENLP_URL] [-n NER_MODEL]
This tool can parse files.
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-i IN, --in IN Path to Input File. (default: None)
-li LIST, --list LIST
Path to a text file which contains list of input file
paths (default: None)
-o OUT, --out OUT Path to output file. (default: None)
-p TIKA_URL, --tika-url TIKA_URL
URL of Tika Server. (default: None)
-c CORENLP_URL, --corenlp-url CORENLP_URL
CoreNLP Server URL (default: http://localhost:9000)
-n NER_MODEL, --ner-model NER_MODEL
Path (on Server side) to NER model (default: None)
NOTE: This expects three services to be running on its default ports: CoreNlPServer(:9000), Parser-Server (aka Tika Server :9998), Grobid Server (:8080) Example:
$ python corenlpparser.py -li pdfpaths.list \
-o parsed-tika-grobid-corenlp.jl \
-n <abspath>/ner-model-jpl-chemistry.ser.gz
In this step we update the Solr index with the JSON line dump produced in the previous step. Usage:
$ python indexer.py -h
usage: indexer.py [-h] [-v] -i IN [-s SOLR_URL] [-sc SCHEMA] [-u]
This tool can read JSON line dump and index to solr.
optional arguments:
-h, --help show this help message and exit
-v, --version show program's version number and exit
-i IN, --in IN Path to Input JSON line file. (default: None)
-s SOLR_URL, --solr-url SOLR_URL
URL of Solr core. (default:
http://localhost:8983/solr/docs)
-sc SCHEMA, --schema SCHEMA
Schema Mapping to be used. Options: ['journal',
'basic'] (default: journal)
-u, --update Update documents in the index (default: False)
Example:
python indexer.py -i parsed-tika-grobid-corenlp.jl -u
NOTE:
Option -u
will update the documents (thus it preserves the brat annotations which were previously added).
Not specifying this option will overwrite the documents.