Add new documents to Solr index - wkiri/MTE GitHub Wiki

Add new documents to Solr

Step 1: Parse the PDFs using Tika, Grobid, and CoreNLP

List all the PDFs to a list file

find <dir> -name *.pdf > pdfpaths.list

Parse PDFs:

$ python corenlpparser.py -h
usage: CoreNLPParser [-h] [-v] (-i IN | -li LIST) -o OUT [-p TIKA_URL]
                     [-c CORENLP_URL] [-n NER_MODEL]

This tool can parse files.

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -i IN, --in IN        Path to Input File. (default: None)
  -li LIST, --list LIST
                        Path to a text file which contains list of input file
                        paths (default: None)
  -o OUT, --out OUT     Path to output file. (default: None)
  -p TIKA_URL, --tika-url TIKA_URL
                        URL of Tika Server. (default: None)
  -c CORENLP_URL, --corenlp-url CORENLP_URL
                        CoreNLP Server URL (default: http://localhost:9000)
  -n NER_MODEL, --ner-model NER_MODEL
                        Path (on Server side) to NER model (default: None)

NOTE: This expects three services to be running on its default ports: CoreNlPServer(:9000), Parser-Server (aka Tika Server :9998), Grobid Server (:8080) Example:

$ python corenlpparser.py -li  pdfpaths.list \
  -o parsed-tika-grobid-corenlp.jl \
  -n <abspath>/ner-model-jpl-chemistry.ser.gz

Step 2: Index the parsed documents

In this step we update the Solr index with the JSON line dump produced in the previous step. Usage:

$ python indexer.py -h
usage: indexer.py [-h] [-v] -i IN [-s SOLR_URL] [-sc SCHEMA] [-u]

This tool can read JSON line dump and index to solr.

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -i IN, --in IN        Path to Input JSON line file. (default: None)
  -s SOLR_URL, --solr-url SOLR_URL
                        URL of Solr core. (default:
                        http://localhost:8983/solr/docs)
  -sc SCHEMA, --schema SCHEMA
                        Schema Mapping to be used. Options: ['journal',
                        'basic'] (default: journal)
  -u, --update          Update documents in the index (default: False)

Example:

python indexer.py -i parsed-tika-grobid-corenlp.jl -u

NOTE: Option -u will update the documents (thus it preserves the brat annotations which were previously added). Not specifying this option will overwrite the documents.

⚠️ **GitHub.com Fallback** ⚠️