ICP 6 - PavankumarManchala/BigDataProgrammingICPs GitHub Wiki
Submitted by:
Pavankumar Manchala
Apache Lucene:
- Fast, high performance, scalable search/IR library
- Open source
- Initially developed by Doug Cutting (Also author of Hadoop)
- Indexing and Searching
- Inverted Index of documents
- Provides advanced Search options like synonyms,stop words, based on similarity, proximity
Lucene Architecture:
Apache Solr:
- Created by Yonik Seeley for CNET
- Enterprise Search platform for Apache Lucene
- Open source
- Highly reliable, scalable, fault tolerant
- Support distributed Indexing (SolrCloud), Replication, and load balanced querying
Solr Architecture:
In Class Tasks:
Datasets:
-
Film Dataset: Dataset: https://github.com/apache/lucene-solr/blob/master/solr/example/films/films.csvICP
-
Books Dataset Dataset: https://github.com/apache/lucene/Solr/blob/master/solr/example/exampledocs/books.csvICP
Execute any queries on the given dataset.
Queries performed on Books dataset:
-
Query to get specific price
-
Query to get Ice in series_t label
-
Query to get Scifi genre
-
Query to get inStock with false and display ascending order of price and display series_t, author_s, inStock fields
-
Query to get the name which contains of
Queries performed on Films dataset:
Copying the Films dataset:
Displaying the Films dataset:
-
Query to display the genre which contains Black
-
Query to display the specific initial release date
-
Query to display the Gary in directed by
Bonus: Query to display the comedy in all fields
Bonus: Query to display the comedy with limit no of words
Bonus: Query to display the specific range of versions