Project_Based_Exam - PallaviArikatla/Big-Data-Programming GitHub Wiki

PROJECT BASED EXAM -- SOLR USE CASE

TEAM MEMBERS AND CONTRIBUTION

Roshini varada --Facebook Mutual Friends

Project link -- https://github.com/RoshiniVarada/BDP_Projects/tree/master/Case1-FacebookMutualFriends

Wiki link -- https://github.com/RoshiniVarada/BDP_Projects/wiki/Case1-FacebookMutualFriends
Zakari, Abdulmuhaymin --Youtube analysis

Project link -- https://github.com/RoshiniVarada/BDP_Projects/tree/master/Case2-YouTubeAnalysis

Wiki link -- https://github.com/RoshiniVarada/BDP_Projects/wiki/Case2-YouTubeAnalysis
Sarika Reddy Kota -- Hive Use case

Project link -- https://github.com/RoshiniVarada/BDP_Projects/tree/master/Case3_HiveUseCase

Wiki link -- https://github.com/RoshiniVarada/BDP_Projects/wiki/Case3_Hive_Usecase
Pallavi Arikatla -- Solr use case

Project link -- https://github.com/RoshiniVarada/BDP_Projects/wiki/Case4-Solr

Wiki link -- https://github.com/RoshiniVarada/BDP_Projects/wiki/Case4-Solr

OBJECTIVE :

Implementation of queries in Apache Solr using Apache lucene.

INTRODUCTION:

Solr is a search platform that is built using Lucene which is truly compatible and highly scalable.
Solr indexes and stores the documents.
As soon as we search for a particular keyword, it hits the solr and returns every possible relatable results searching from several other websites.

IDEA OF THE PROJECT

Idea of the project is to apply concepts learnt in Big-Data-Programming so far in real-time. Hadoop is a open source platform to store immense amount of data and to run numerous applications on cluster commodities. In this project we apply different concepts like MapReduce, Hive and Solr to work on that massive data and make task simpler.

USAGE OF PROJECT IN REAL-TIME

Apache Solr is one such open source platform which provides search and index replication. It is built using Apache lucene which makes it scalable. It works for multiple sites, because it can index the entire data and search the information in several websites and returns all the relevant recommendations for the searched content. Solr aids you to effortlessly create search engines which searches websites, databases and documents. Hence it is an effective way to work on large data in real-time.

IMPLEMENTATION:

Open a new terminal in cloudera.
Create schema configuration for the selected dataset.

solrctl instancedir --generate /tmp/hero4

gedit /tmp/hero4/conf/schema.xml
A schema page pops up which is to be edited. Replace all the id's with the id's present in the dataset and change the unique id with first id in the dataset.
Now we have to create this dataset in the Solr interface. Type the following commands which helps in pushing the dataset to Solr page.

Create instance directory:

solrctl instancedir --create hero4 /tmp/hero4

solrctl collection --create hero4

Similarly now apply this to another dataset if required. And the commands are as follows:

Create schema configuration:

solrctl instancedir --generate /tmp/sh

gedit /tmp/sh/conf/schema.xml

Create instance directory:

solrctl instancedir --create sh /tmp/sh

solrctl collection --create sh

* Now get to Solr interface and select the dataset created.

* Click on documents --> Change file type (CSV) --> Copy the entire data in the dataset and place it under documents column --> Sumbit the document.

* If your document import is successful you can observe the following after submitting the document.

* Now you are ready to run the queries.

QUERIES:

Query 1:

* **Keyword matching**: Searches for a particular word in the entire index and displays that data present in the entire dataset.

  Eye_color: blue

Query 2:

* **AND**: Searches for both the phrases.

  Hair_color: "No Hair" AND Publisher_I: "DC Comics"

  Searches for 'No Hair' in Hair_color index and 'DC Comics' in Publisher_I index simultaneously.

Query 3:

* **OR**: Searches for either of the phrases.

  Hair_color:"No Hair" AND Publisher_I:"DC Comics" OR Eye_color:blue

  Searches for 'No Hair' in Hair_color index and 'DC Comics' in Publisher_I index OR 'Blue' in Eye_color index.

Query 4:

* **NOT**: Searches for a keyword in a index with a NOT limitation.

 NAME_I:Angel -Hair_color:black

 Searches for 'Angel' in index NAME_I without 'black' as Hair_color in that row.

Query 5:

* **Wildcard**: Searches for a particular word start in a string.

 Eye_color:yellow*

 Searches for any word starting with 'yellow' in index Eye_color.

Query 6:

* **Proximity**: Searches for word within proximity range.

  Hair_color:"Blond"~1

  Searches for 'Blond' in Hair_color within 1 word from each other.

Query 7:

* **Boolean multiple clauses:** A Boolean query contains multiple queries to work on.

(+agility_I:TRUE +agility_I:(FALSE faceting)) OR (+Lantern_Power_Ring:TRUE +Lantern_Power_Ring:(TRUE faceting))

 OR operation means that the clauses are optional.

Query 8:

* **Boosted**: Creates importance of for a particular searched word. It prioritizes the searched content as per the boosting factor.

  Dimensional_Awareness:(TRUE^10 TRUE)

 It prioritizes TRUE to ^10 times in Dimensional_Awareness. The operand '^' helps in prioritizing the content.

Query 9:

* **Range**: Allows to match fields and displays output with in the specifies upper and lower limit.

  Weight_I: [95.0 TO *]

  The above query helps in printing the data containing Weight_I above 95.0.

Query 10:

* **Facet**: Determines unique terms in a specified field.

  &facet=True
  &facet.field=Skin_color

  Displays only Skin_color field.

  &facet=Female
  &facet.field=Height_I

  Displays only Height_I field.

Time of execution:

As I have worked on single dataset at a time, execution takes no mean time. And the execution time usually be dependent on the indexes result.

Query 1: 1.83ms

Query 2: 2.45ms

Query 3: 3.02ms

Query 4: 2.55ms

Query 5: 2.85ms

Query 6: 3.38ms

Query 7: 4.61ms

Query 8: 4.26ms

Query 9: 4.91ms

Query 10: 5.61ms

Execution of all these queries can happen in less than a minute. Hence this can be a effective word search method.

CHALLENGES FACED:

I have almost faced no challenges during the process of execution. Entire procedure went efficiently without any obstacles. The only issue I have faced is while loading Datasets to the Solr interface. Although the issue is not so problematic, it consumes time if your schema isn’t edited properly. Paying attention while configuring schema for large dataset helps.

MILESTONES AND INTEGRATION OF THE PROJECT.

As there is no dependency with each other's split up work, the work went effective without any issues and we have successfully accomplished the tasks taken.

Video Link:

https://www.youtube.com/watch?v=3hVNnasr6UQ