Solr indexing in EC2 - npmajisha/bag-the-job GitHub Wiki
- EC2 instance with a security group that permits TCP calls on port 8983
- jdk and jre installed in EC2 instance
- [pysolr] (https://pypi.python.org/pypi/pysolr)
- Download the latest version of Lucene Solr [here] (http://lucene.apache.org/solr/mirrors-solr-latest-redir.html) and install it in EC2 instance
- Start the jetty server. Instructions can be found [here] (http://lucene.apache.org/solr/quickstart.html)
- Make sure that Solr is running by accessing :8983 from local browser
####Indexing files using Indexer
Note: Currently the indexer accepts only files in json format
- Create a folder for log files(E.g. logs) and a jobs folder - 'jobs'
- Add configuration details into a json file (E.g. jobIndex.json)
{
"sourceS3Bucket": "(name of the s3 bucket where your files are located)",
"sourceS3Folder": "(name of the folder within the bucket)",
"solrServerPort": "(ec2 ip address and port information where solr server is running , e.g '54.86.116.103:8983')",
"solrCoreName": "(core name in solr to index the files)"
"assumedRole": "(assumed role if accessing a shared s3 bucket)"
}
- Run the command 'nohup python3 indexer.py -c <config file> -lf logs/<logfile> > jobs/std.out 2> jobs/std.err < /dev/null &' to start the indexing process in the background
- Download and install the 'node-solr' package to the Express project using npm.
- Documentation for access can be found [here] (http://lbdremy.github.io/solr-node-client/)
- Make sure that the solr.createClient takes host=, port='8983' and core=' as parameters