Module 1: ICP #7 - SnehaMishra28/BigData_Programming_Summer2018 GitHub Wiki
Team: 12
Professor: Yugyung Lee
Name: Sneha Mishra
Class ID: 11
Email: [email protected]
MyGitHub
Technical Partner:
Name: Aditya Soman
Class ID: 19
Email: [email protected]
GitHub
Objective
Introduction to Lucene and Solr. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites.
Properties of Lucene include:
- Fast, high performance, scalable search/IR library
- Open source
- Provides advanced search options like synonyms, stopwords, based on similarity, proximity.
Features
- Create database using the datasets
- Execute queries on the given dataset
- Perform partial match, fuzzy search and proximity search
Steps:
Step 1: Install Sqoop (or Cloudera)



Step 2: Create config files


Create Instance Dir

Part 1: Book Dataset
Document Created:


Query 1:

Query 2:

Query 3:

Query 4:

Query 5:

Query 6:

Part 2: Film Dataset
Update the Schema.xml file:

Document Created:

Query 1:

Query 2:

Query 3:

Query 4:

Query 5:

Bonus Query: Proximity search on Film dataset

References:
Films Dataset
https://github.com/apache/lucene-solr/blob/master/solr/example/films/films.csv
Books Dataset
https://github.com/apache/lucene-solr/blob/master/solr/example/exampledocs/books.csv