Faceted Search - JulianThijssen/TelegraafES GitHub Wiki

Faceted Search

In order to be able to exclude certain kinds of documents from the search (such as advertisements) it is important to have faceted search. This means that we options through which we can filter the full return set and bring it down to just a number of results of the type that we're after.

In the case of the Telegraaf collection, there are a lot of 1-word advertisements and non-informative documents. We can easily filter these out if we add a facet for the type of document (article, advertisement, image with subscript).

We could change our query to involve the selected facet like this:

{
    'query': {
        'bool': {
            'should': {
                'match': {
                    'title': {
                         'query': $title,
                         'operator': 'and'
                    }
                }
                'match': {
                    'text': {
                         'query': $text,
                         'operator': 'and'
                    }
                }
            }
            'must': {
                'match': {
                    'subject': $subject
                }
            }
        }
    }
}

However, since elasticsearch only returns a limited number of hits (default: 10) but in reality has many more hits, there is a problem. We might have 3 documents that we want to filter out of our set of 10 documents, however as soon as we do so, it will have 10 new documents that match our facet. This is not expected behaviour in faceted search.

One option to combat this would be to return all results from a given query, however this is problematic for performance. Instead, we chose to return 500 hits, and to programmatically whittle down these results with the selected facet. This is done simply by going through all results and only showing / using these results as our result set.

We feel that this solution is fair to the end-user as the top 500 results are most likely the most relevant results anyway. If the user wants more precision and has run out of results when using facets, the query should probably be altered. Still, a long term solution needs to be implemented that performs the faceting in a less biased manner.