ICP 6 - PavankumarManchala/BigDataProgrammingICPs GitHub Wiki

Submitted by:

Pavankumar Manchala

Apache Lucene:

Fast, high performance, scalable search/IR library
Open source
Initially developed by Doug Cutting (Also author of Hadoop)
Indexing and Searching
Inverted Index of documents
Provides advanced Search options like synonyms,stop words, based on similarity, proximity

Lucene Architecture:

Apache Solr:

Created by Yonik Seeley for CNET
Enterprise Search platform for Apache Lucene
Open source
Highly reliable, scalable, fault tolerant
Support distributed Indexing (SolrCloud), Replication, and load balanced querying

Solr Architecture:

In Class Tasks:

Datasets:

Film Dataset: Dataset: https://github.com/apache/lucene-solr/blob/master/solr/example/films/films.csvICP
Books Dataset Dataset: https://github.com/apache/lucene/Solr/blob/master/solr/example/exampledocs/books.csvICP

Execute any queries on the given dataset.

Queries performed on Books dataset:

Query to get specific price
Query to get Ice in series_t label
Query to get Scifi genre
Query to get inStock with false and display ascending order of price and display series_t, author_s, inStock fields
Query to get the name which contains of

Queries performed on Films dataset: Copying the Films dataset:

Displaying the Films dataset:

Query to display the genre which contains Black
Query to display the specific initial release date
Query to display the Gary in directed by

Bonus: Query to display the comedy in all fields

Bonus: Query to display the comedy with limit no of words

Bonus: Query to display the specific range of versions