Creating Solr Index from Scratch - wkiri/MTE GitHub Wiki

Notes:

Setup Solr

Download Solr

mkdir workspace && cd workspace
wget http://archive.apache.org/dist/lucene/solr/6.1.0/solr-6.1.0.tgz
tar xvzf solr-6.1.0.tgz
cd solr-6.1.0

Start and Create a Core

PORT=8983
bin/solr start -p $PORT
bin/solr create_core -c docs -d $YOUR_PATH/conf/solr/docs -p $PORT

To confirm solr setup completion, visit http://<host>:8983/solr/

Setup parser-server, Tika, and Grobid

Refer to the README in parser-server directory for setting up parser server (or see https://github.com/USCDataScience/parser-indexer-py/tree/master/parser-server).
When the parser server is running on http://localhost:9998/ follow the below steps:

Setup CoreNLP

Download and Start Stanford CoreNLP Server on port :9000

Step : Download CoreNLP

Visit http://stanfordnlp.github.io/CoreNLP/download.html, download the zip. Extract the zip. Note: this runs on Java 8

Step : Install Python dependencies

Follow instructions in https://github.com/smilli/py-corenlp

pip install pycorenlp

Step : Start Core NLP server

Goto CoreNLP extracted directory and run

java -mx4g -cp "*" edu.stanford.nlp.pipeline.StanfordCoreNLPServer

You can test it by going here: http://localhost:9000/

Tips:

  • To restart, kill the service and start “nohup corenlpserver.sh &”
  • To change the NER model – edit ‘ner.model’ in $CORENLP_HOME/StanfordCoreNLP.properties
  • If needed, select "English" in the web interface for the language.

Parse and add the PDF documents to Solr

Add new documents to Solr index

⚠️ **GitHub.com Fallback** ⚠️