ICP Assignment 6 - MadhuriSarode/BDP GitHub Wiki

Student ID : 24 : Madhuri Sarode

Student ID : 4 : Bhargavi

Student ID : 16 : Bhavana

Parallel Indexing : Solr and Lucene

Solr is an open-source enterprise-search platform, written in Java, from the Apache Lucene project. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features[2] and rich document (e.g., Word, PDF) handling.It uses the Lucene Java search library at its core for full-text indexing and search

The solr UI has the following options

1)Dashboard - Gives general system information
2)Logging - Displays the messages from solr log files . Classpaths and class names for the instance running is provided under level option. The logging capabilities of the classes can be changed from finest to off.
3)CoreAdmin - Helps manage cores
4)Java properties - It allows you to see the properties of JVM running the core.
5)Thread dumps - Shows current threads active on the server. Multiple states of the threads such as waitng, current, runnable etc can be seen in the list.

The core selector gives details about the core selected. The following are the options below it

1)Overview - Shows the statistics and the instance information
2)Analysis - This scrren lets you inspect how the data will be handled either during indexing or query time according to the field, field type and dynamic role configurations found in schema.xml
3)Data import - This screen shows the data import handler using which the data from XML files or databases can be imported
4)Documents - The screen allows you to execute multiple solr indexing commands directly from browser.Also the documents can be constructed by defining fields.
5)Files- Screen shows the existing solr application files
6)Plugins/stats - Shows the statistics for plugins and installed components
7)Query - Structured query can be executed and results can be analaysed.

A core is a running instance of Lucene along with all of the solr configuration.Its a full copy of lucene index with it's own schema and configuration.

Query screen fields

1. Request handler - Specify request handler (Standard : select)
1. q - The query
1. fq - Filter query
1. sort - sort ascending/descending
1. start,rows - Start is offset of returned result, rows control how many results are returned
1. fl- Specifies which fileds are returned in the response
1. wt- Response writer (xml,json etc)
1. debug query - Used to augment debug information

The default request handler is select. The result of the data can be obtained in JSON,CSV,XML etc. The URL for executing a query is made up of hostname, portnumber(8983),request handler for query and the query itself. Looking at the response in an XML structure we can see that it has the following components

a)Response header- In it if the status is 0, query got executed with no errors. Qtime is the query time. 
                    params are the parameters being passed from the query.

b)Response - contains the `docs` which has set of fields. Also number of rows/elements/docs found is also 
              specified at the start with the parameter numFound=x

Core creation, data addition and querying the data for 2 datasets

1)Film Dataset

Using the following commands, the solr core is created. Log configuration is defined.

New instance directories are created for each datasets.

The schema file contains details about document fields and how to treat when documents added to or queried from the index

name: field name
Type : field type (string,int etc)
Indexed : true: its searched and can be used for sorting
Stored : false : cannot be retrieved true : can be retrieved after getting results
multivalued : sub sections for the filed
Dynamic fields - Fileds are not defined until we index it. Not predefined in schema

The document is uploaded in CSV format, the whole document data is pasted in the text box,the request handler update is used to add the document.

The films.csv file data is shown below. Each field is separated by ‘,’

The different fields of the file can be viewed in schema browser.

The schema.xml file can be viewed under files tab.

Query 1

Select id,author_t,genre_s from films where author_t:Gary Lennon and genre_s:Musical ; is executed to get the following result. The result is in the CSV format

Query 2

Select name,author_t from films where features_t:[2004-07-04 to *] The result is in csv format

Query 3

Select * from films where author_t:David* and genere_s:Animation*

Query4

Select name,author_t,genre_s from films where name:Harry* and features:[2002 to 2004];

Query5

Select * from films where name:[300] or genre_s:Adventure;

2) Books dataset

The books csv file is as shown below

The data is copied into the textbox in the document upload screen and the document is uploaded. The success message indicates that the document has been successfully uploaded into sole