Module 1: ICP #7 - SnehaMishra28/BigData_Programming_Summer2018 GitHub Wiki

Team: 12
Professor: Yugyung Lee

Name: Sneha Mishra
Class ID: 11
Email: [email protected]
MyGitHub

Technical Partner:
Name: Aditya Soman
Class ID: 19
Email: [email protected]
GitHub

Objective

Introduction to Lucene and Solr. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites.
Properties of Lucene include:

  1. Fast, high performance, scalable search/IR library
  2. Open source
  3. Provides advanced search options like synonyms, stopwords, based on similarity, proximity.

Features

  1. Create database using the datasets
  2. Execute queries on the given dataset
  3. Perform partial match, fuzzy search and proximity search

Steps:

Step 1: Install Sqoop (or Cloudera)

Step 2: Create config files

Create Instance Dir

Part 1: Book Dataset

Document Created:

Query 1:

Query 2:

Query 3:

Query 4:

Query 5:

Query 6:

Part 2: Film Dataset

Update the Schema.xml file:

Document Created:

Query 1:

Query 2:

Query 3:

Query 4:

Query 5:

Bonus Query: Proximity search on Film dataset

References:

Films Dataset

https://github.com/apache/lucene-solr/blob/master/solr/example/films/films.csv

Books Dataset

https://github.com/apache/lucene-solr/blob/master/solr/example/exampledocs/books.csv