ICP 6 - PavankumarManchala/BigDataProgrammingICPs GitHub Wiki

Submitted by:

Pavankumar Manchala

Apache Lucene:

  • Fast, high performance, scalable search/IR library
  • Open source
  • Initially developed by Doug Cutting (Also author of Hadoop)
  • Indexing and Searching
  • Inverted Index of documents
  • Provides advanced Search options like synonyms,stop words, based on similarity, proximity

Lucene Architecture:

Apache Solr:

  • Created by Yonik Seeley for CNET
  • Enterprise Search platform for Apache Lucene
  • Open source
  • Highly reliable, scalable, fault tolerant
  • Support distributed Indexing (SolrCloud), Replication, and load balanced querying

Solr Architecture:

In Class Tasks:

Datasets:

  1. Film Dataset: Dataset: https://github.com/apache/lucene-solr/blob/master/solr/example/films/films.csvICP

  2. Books Dataset Dataset: https://github.com/apache/lucene/Solr/blob/master/solr/example/exampledocs/books.csvICP

Execute any queries on the given dataset.

Queries performed on Books dataset:

  1. Query to get specific price

  2. Query to get Ice in series_t label

  3. Query to get Scifi genre

  4. Query to get inStock with false and display ascending order of price and display series_t, author_s, inStock fields

  5. Query to get the name which contains of

Queries performed on Films dataset: Copying the Films dataset:

Displaying the Films dataset:

  1. Query to display the genre which contains Black

  2. Query to display the specific initial release date

  3. Query to display the Gary in directed by

Bonus: Query to display the comedy in all fields

Bonus: Query to display the comedy with limit no of words

Bonus: Query to display the specific range of versions