ICP 06 : Solr and Lucene - acikgozmehmet/BigDataProgramming GitHub Wiki

ICP 6: Parallel Indexing : Solr and Lucene

Overview

Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites.

Objectives

Film Dataset

Dataset: https://github.com/apache/lucene-solr/blob/master/solr/example/films/films.csv

ICP question: Execute any 5 queries on the given dataset
Books Dataset

Dataset: https://github.com/apache/lucene-solr/blob/master/solr/example/exampledocs/books.csv

ICP question: Execute any 5 queries on the given dataset

1. Film Dataset

solrctl instancedir --generate /tmp/films
gedit /tmp/films/conf/schema.xml&

Add the following attributes to the schema.

<field name="id" type="string" indexed="true" stored="true" required="true" />

`<field name="genre" type="text_general" indexed="true" stored="true" /> `

`<field name="directed_by" type="text_general" indexed="true" stored="true" /> `

`<field name="initial_release_date" type="text_general" indexed="true" stored="true" /> `

`<field name="type" type="text_general" indexed="true" stored="true" /> `

solrctl instancedir --create films /tmp/films
solrctl collection --create films

How to load data using solr:

Click on solr on the explorer
Select the dataset from the dropdown menu on the left side of the page.
Click on the Documents tab on the left side in the next page and select the csv file in the documents type
Copy and paste the content of the films in the document(s)
Submit document
Select the query icon on the left side of the page to create some queries on the data.

Queries on Films Dataset

Query 1:

To print movie name, director and release date where genre is Bollywood from top 200 records and print in ascending order of movie name.

Query 2:

Range queries

To print the movie name, director, genre and release date which got released before 2010-01-01.

Query 3:

To do fuzzy search on the word "roam".

To print the same details for the movies whose name contains 75 % of the letters coincident with "roam"

Query 4:

Using OR operator with boosting

To print the movie name, director, genre and release date of the movies whose category is "comedy" or got released date 2005

Query 5:

Proximity search

To print the movie name, director, genre and release date of the movies whose name has "Heaven" and "Trip" words in the neighborhood of 3.

2. Books Dataset

solrctl instancedir --generate /tmp/books
gedit /tmp/books/conf/schema.xml&

Add the following attributes to the schema.

<field name="series_t" type="text_general" indexed="true" stored="true"/>

<field name="sequence_i" type="text_general" indexed="true" stored="true"/>

<field name="genre_s" type="text_general" indexed="true" stored="true"/>

solrctl instancedir --create books /tmp/books
solrctl collection --create books

How to load data using solr:

Please check out the previous section on Films Dataset.

Queries on Books Dataset

Query 1:

Range query with AND operator

To print id, author, and price of the books whose genre is "fantasy" AND price range is between 5 and 7 in ascending order

Query 2:

Query with AND operator

To print id, author, and price of the books whose price is higher than 6 AND not available in stocks in ascending order

Query 3:

Query with boosting and OR operator

To print id, author, and price of the books whose author name is "George" OR title contains the word "company" in ascending order

Query 4:

Proximity Query:

To print id, author, and price of the books whose series have the words of "Song" and "Fire" in the proximity of 10 in ascending order

Query 5:

Fuzzy Query:

To print name, price and author of the books whose name contains at least 80 % of the letters of the word 'Georeg'

References:

http://www.lucenetutorial.com/lucene-vs-solr.html

https://lucene.apache.org/solr/