Search Engine - JulianThijssen/TelegraafES GitHub Wiki

The Search Engine

The search engine is built on Elastic Search and works on the Telegraaf collection. It allows full text search through the documents and will return a number of results on the front page.

Indexing

The indexing of the collection is fueled by a Java program that takes a list of XML documents, splits them up in bulk Json requests and then individually feeds them into the Elastic Search. This makes the process of indexing fully-automatic except for the user having to supply a list of document he/she wants to index. This could have been programmed in any programming language, but I am most comfortable in Java.

Querying

The querying of the collection is enabled by elasticsearch-php which makes it very easy to post requests through to ElasticSearch. It also minimizes any abstraction between the site back-end where requests from the user come in, and immediately passes it through.

if (!empty($title)) {
    array_push($params['body']['query']['bool']['should'], array('match' => array('title' => array('query' => $title, 'operator' => 'and', 'boost' => 2) )));
}
if (!empty($text)) {
    array_push($params['body']['query']['bool']['should'], array('match' => array('text' => array('query' => $text, 'operator' => 'and', 'boost' => 1) )));
}
$response = $client->search($params);

The front page

The front page of the site contains two search fields (title and text) with which the user can enter complex queries. Once a query has been entered, the top 10 hit results show up. A result consists of a title (linked to the original document) or if the document didn't have a title, the type of document. Below the title are the first few lines of the document, which form a short description.