Big_Data_Programming_ICP_6 - kusamdinesh/Big-Data-and-Hadoop GitHub Wiki

Title: Lucene and Solr

Solr runs on the Lucene and we run the query inside the Solr and upload the documents.

1.Create the directory and collection in the terminal for Solr:

Upload the data in the Apache Solr:

inorder to upload the document we need to add the fields present in dataset to the schema.

Once we upload the document into Solr, then we can run queries on the documents like searching for different words. We are performing a simple search on the films dataset. Searching for a keyword "Park Avenue".

We're performing a range query on films dataset. This time instead of using the number range, we're using the word range.

Wildcard Searches: Solr’s standard query parser supports single and multiple character wildcard searches within single terms. Wildcard characters can be applied to single terms, but not to search phrases.

Here we're searching for Ada*ion, the output result will be the films with the names matching the starting and ending letters in the search query.

Wildcard Search using single letter search: In wildcard search using single letter search, we will be using '?' in the search where the search string te?t would match both test and text.

Fuzzy search

2.Now we're using books dataset, we're uploading the document into Solr.

pattern match:

Performing Boolean NOT query on the books dataset, which requires that the following term not to be present.

Performing Boolean OR query on the books dataset, which requires either of the terms or both terms to be present in the search query:

Fuzzy Searches: Solr’s standard query parser supports fuzzy searches based on the Damerau-Levenshtein Distance or Edit Distance algorithm. Fuzzy searches discover terms that are similar to a specified term without necessarily being an exact match. To perform a fuzzy search, use the tilde ~ symbol at the end of a single-word term. For example, to search for a term similar in spelling to "roam," use the fuzzy search:

roam~

This search will match terms like roams, foam, & foams. It will also match the word "roam" itself.

An optional distance parameter specifies the maximum number of edits allowed, between 0 and 2, defaulting to 2. For example:

roam~1

This will match terms like roams & foam - but not foams since it has an edit distance of "2".

For this query to be performed we're editing a movie name in the dataset from "Game of Kings" to "Fame of Kings".

When we search for fame with distance 1, then it returns the following output

Proximity Searches: A proximity search looks for terms that are within a specific distance from one another.

Range Searches: A range search specifies a range of values for a field (a range with an upper bound and a lower bound). The query matches documents whose values for the specified field or fields fall within the range. Range queries can be inclusive or exclusive of the upper and lower bounds. Sorting is done lexicographically, except on numeric fields.

Here in the search, we're using the ID field as the search factor. The search retrieves the movies having the ID from 0380014300 TO 0553573403.