ICP 06 : Solr and Lucene - acikgozmehmet/BigDataProgramming GitHub Wiki
ICP 6: Parallel Indexing : Solr and Lucene
Overview
Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more. Solr powers the search and navigation features of many of the world's largest internet sites.
Objectives
-
Film Dataset
Dataset: https://github.com/apache/lucene-solr/blob/master/solr/example/films/films.csv
ICP question: Execute any 5 queries on the given dataset
-
Books Dataset
Dataset: https://github.com/apache/lucene-solr/blob/master/solr/example/exampledocs/books.csv
ICP question: Execute any 5 queries on the given dataset
1. Film Dataset
-
solrctl instancedir --generate /tmp/films
-
gedit /tmp/films/conf/schema.xml&
Add the following attributes to the schema.
<field name="id" type="string" indexed="true" stored="true" required="true" />
`<field name="genre" type="text_general" indexed="true" stored="true" /> `
`<field name="directed_by" type="text_general" indexed="true" stored="true" /> `
`<field name="initial_release_date" type="text_general" indexed="true" stored="true" /> `
`<field name="type" type="text_general" indexed="true" stored="true" /> `
-
solrctl instancedir --create films /tmp/films
-
solrctl collection --create films
How to load data using solr:
- Click on solr on the explorer
- Select the dataset from the dropdown menu on the left side of the page.
- Click on the Documents tab on the left side in the next page and select the csv file in the documents type
- Copy and paste the content of the films in the document(s)
- Submit document
- Select the query icon on the left side of the page to create some queries on the data.
Queries on Films Dataset
Query 1:
To print movie name, director and release date where genre is Bollywood from top 200 records and print in ascending order of movie name.
Query 2:
Range queries
To print the movie name, director, genre and release date which got released before 2010-01-01.
Query 3:
To do fuzzy search on the word "roam".
To print the same details for the movies whose name contains 75 % of the letters coincident with "roam"
Query 4:
Using OR operator with boosting
To print the movie name, director, genre and release date of the movies whose category is "comedy" or got released date 2005
Query 5:
Proximity search
To print the movie name, director, genre and release date of the movies whose name has "Heaven" and "Trip" words in the neighborhood of 3.
2. Books Dataset
-
solrctl instancedir --generate /tmp/books
-
gedit /tmp/books/conf/schema.xml&
Add the following attributes to the schema.
<field name="series_t" type="text_general" indexed="true" stored="true"/>
<field name="sequence_i" type="text_general" indexed="true" stored="true"/>
<field name="genre_s" type="text_general" indexed="true" stored="true"/>
- solrctl instancedir --create books /tmp/books
- solrctl collection --create books
How to load data using solr:
Please check out the previous section on Films Dataset.
Queries on Books Dataset
Query 1:
Range query with AND operator
To print id, author, and price of the books whose genre is "fantasy" AND price range is between 5 and 7 in ascending order
Query 2:
Query with AND operator
To print id, author, and price of the books whose price is higher than 6 AND not available in stocks in ascending order
Query 3:
Query with boosting and OR operator
To print id, author, and price of the books whose author name is "George" OR title contains the word "company" in ascending order
Query 4:
Proximity Query:
To print id, author, and price of the books whose series have the words of "Song" and "Fire" in the proximity of 10 in ascending order
Query 5:
Fuzzy Query:
To print name, price and author of the books whose name contains at least 80 % of the letters of the word 'Georeg'