Solr indexing in EC2 - npmajisha/bag-the-job GitHub Wiki

Prerequisites

EC2 instance with a security group that permits TCP calls on port 8983
jdk and jre installed in EC2 instance
[pysolr] (https://pypi.python.org/pypi/pysolr)

Installation

Download the latest version of Lucene Solr [here] (http://lucene.apache.org/solr/mirrors-solr-latest-redir.html) and install it in EC2 instance
Start the jetty server. Instructions can be found [here] (http://lucene.apache.org/solr/quickstart.html)
Make sure that Solr is running by accessing :8983 from local browser

####Indexing files using Indexer
Note: Currently the indexer accepts only files in json format

Create a folder for log files(E.g. logs) and a jobs folder - 'jobs'
Add configuration details into a json file (E.g. jobIndex.json)

{  
  "sourceS3Bucket": "(name of the s3 bucket where your files are located)",  
  "sourceS3Folder": "(name of the folder within the bucket)",  
  "solrServerPort": "(ec2 ip address and port information where solr server is running , e.g '54.86.116.103:8983')",  
  "solrCoreName": "(core name in solr to index the files)" 
  "assumedRole": "(assumed role if accessing a shared s3 bucket)"
}

Run the command 'nohup python3 indexer.py -c <config file> -lf logs/<logfile> > jobs/std.out 2> jobs/std.err < /dev/null &' to start the indexing process in the background

Solr access from Node.js server

Download and install the 'node-solr' package to the Express project using npm.
Documentation for access can be found [here] (http://lbdremy.github.io/solr-node-client/)
Make sure that the solr.createClient takes host=, port='8983' and core=' as parameters

⚠️ GitHub.com Fallback ⚠️