ICP 6 - Murarishetti-Shiva-Kumar/Big-Data-Programming GitHub Wiki

Lesson Plan6: Parallel Indexing: Solr and Lucene

1. Film Dataset

  • Keyword matching
  • Wildcard matching
  • Proximity matching
  • Range searches
  • Fuzzy logic

Creation/generation of instance & Collection

1.Generate instance for films

"solrctl instancedir --generate /tmp/films"

image

2.Edit the schema.xml created with the instance generation inside the configuration folder to change the attributes based on the dataset given. The data types remain as usual for like the normal query based system. In this case the overall attribute is taken as a float attribute. Just keep the flag required as true only for the unique key.(reviewerID) in this case.

image

“gedit /tmp/films/conf/schema.xml”

image

3.Set the unique Id to the required attribute.

image

4.Now, let us upload the contents of instance directory to Zookeeper.

"solrctl instancedir --create films /tmp/films"

image

5.Lets create new collection

"solrctl collection --create films"

image

6.Open the solr browser in the web browser and select the created collection on the leftside dropdown and go into the documents in the collection

image

  1. select the document type to csv. Then copy paste all the data inside the dataset into the documents field and submit the document

image

image

Keyword Matching

It returns the data only if the query parameter exactly matches with the field data.

1.Searching a keyword “Michael” in the data set.

image

2.Searching a phrase in the dataset.

image

3.Searching at a time for two different values using “and” operation.

image

4.Performing AND & OR operation.

image

Wildcard matching

It returns all the data from the field mentioned in the query, if it contains the query parameter.

1.Searching all instances of the keyword Michael in the name attribute

image

2.Returns all the instances where the reviewText string starts with “Not” and ends with “vibration”.

image

Inference

Here it can be observed that keyword matching is the subset of the wildcard matching.

Proximity matching

It identifies if the attribute mentioned in the query has the keywords with the proximity factor provided.

1.Returns all records whose summary has good in the range of 4.

image

Boost

1.It gives importance to the attributes which has a boosting factor. The boosting factor taken in this case is 1.5 but in fact this is a variable factor it can be 2,3,5 etc.

image

Range searches

It gives the results whose attribute value falls in the range mentioned.

1.Finds all the data whose overall rating is between 3 to 4 includes 3 and 4.

image

2.Finds all the data whose overall rating is less than or equal to 4

image

3.Finds all the data whose overall rating is greater than or equal to 4

image

4.Finds all the data whose overall rating is not equal to 5

image

5.Returns all the data that contains the overall field

image

Fuzzy logic

It gives the results if the attribute has an approximate value that is mentioned in the query parameters.

1.It gives all the results with the reviewer name like “daze”

image

2. Books Dataset

  • Execute any 5 queries on the given dataset

1.Generate instance for Books

"solrctl instancedir --generate /tmp/books"

image

2.Edit the schema.xml created with the instance generation inside the configuration folder to change the attributes based on the dataset given. The data types remain as usual for like the normal query based system. In this case the overall attribute is taken as a float attribute. Just keep the flag required as true only for the unique key.(id) in this case.

image

“gedit /tmp/books/conf/schema.xml”

image

  • Here the pricebook is taken as float type and bookinStock is taken as Boolean

3.Set the unique Id to the required attribute.

image

4.Now, let us upload the contents of instance directory to Zookeeper.

"solrctl instancedir --create books /tmp/books"

image

5.Lets create new collection

"solrctl collection --create books"

image

6.Open the solr browser in the web browser and select the created collection on the leftside dropdown and go into the documents in the collection

image

  1. select the document type to csv. Then copy paste all the data inside the dataset into the documents field and submit the document

image

Sort

It sorts based on the attribute mentioned in the sort in the order (asc or desc) as mentioned in the sort parameter.

1.Gives the results of all the records with category book and pricebook in the descending order q=> “catbook”:”book” ; sort=> pricebook desc

image

Filter

It helps in combining the queries.

1.Returns the data with catbook like ”book” and pricebook equals to 7.99

image

Fuzzy Logic

1.Returns the data that matches the genre_s: fantasy

image

2.Identifies if there are any books in stock

image

3.Books with author name george

image

4.series_t exactly matches "A Song of Ice and Fire"

image

Boost

1.Books which genre “scifi” or “fantasy” with a boosting factor and the pricebook with overall 7 or 8.

image

2.Books with price 5.99 and genre fantasy

image

Range

1.Books with price 3 to 8

image

The schemas and commands can be found in file: https://github.com/Murarishetti-Shiva-Kumar/Big-Data-Programming/blob/main/ICP%206/Commands.txt

⚠️ **GitHub.com Fallback** ⚠️