Solr indexing in EC2 - npmajisha/bag-the-job GitHub Wiki

Prerequisites

  1. EC2 instance with a security group that permits TCP calls on port 8983
  2. jdk and jre installed in EC2 instance
  3. [pysolr] (https://pypi.python.org/pypi/pysolr)

Installation

  1. Download the latest version of Lucene Solr [here] (http://lucene.apache.org/solr/mirrors-solr-latest-redir.html) and install it in EC2 instance
  2. Start the jetty server. Instructions can be found [here] (http://lucene.apache.org/solr/quickstart.html)
  3. Make sure that Solr is running by accessing :8983 from local browser

####Indexing files using Indexer
Note: Currently the indexer accepts only files in json format

  1. Create a folder for log files(E.g. logs) and a jobs folder - 'jobs'
  2. Add configuration details into a json file (E.g. jobIndex.json)
{  
  "sourceS3Bucket": "(name of the s3 bucket where your files are located)",  
  "sourceS3Folder": "(name of the folder within the bucket)",  
  "solrServerPort": "(ec2 ip address and port information where solr server is running , e.g '54.86.116.103:8983')",  
  "solrCoreName": "(core name in solr to index the files)" 
  "assumedRole": "(assumed role if accessing a shared s3 bucket)"
}  
  1. Run the command 'nohup python3 indexer.py -c <config file> -lf logs/<logfile> > jobs/std.out 2> jobs/std.err < /dev/null &' to start the indexing process in the background

Solr access from Node.js server

  1. Download and install the 'node-solr' package to the Express project using npm.
  2. Documentation for access can be found [here] (http://lbdremy.github.io/solr-node-client/)
  3. Make sure that the solr.createClient takes host=, port='8983' and core=' as parameters
⚠️ **GitHub.com Fallback** ⚠️