THMapper: Approaches to map tail queries to head queries - nramrakhiyani/THMapper GitHub Wiki

THMapper
Major Project
CSE474 - Information Retrieval and Extraction

Team Members:
Anuj Goyal (201501129),
Mainak Bhattacharjee (20162101),
Nitin Ramrakhiyani (20172091),
Pratyusha Musunuru (201464090)

Problem:
Tail-Query to Head-Query Mapping for Search Engines One of the biggest challenges for commercial search engines is handling tail queries or queries which occur very infrequently. Frequent queries, also known as head queries, are easier to handle largely because their intents are evidenced by abundant click-through data (query logs). Tail queries have little historical data to rely on, which makes them difficult to be learned by ranking algorithms.

The goal of this project is to develop techniques to map tail queries to head queries, such that the performance of the search engine on such queries can be improved.

Project Contributions:
Refer to the Project Report in the repo at https://github.com/nramrakhiyani/THMapper for detailed understanding of the project. The following is a brief overview of the contributions.

  1. Creation of a Tail Query to Head Query mapping dataset
    The AOL queries dataset (available at http://jeffhuang.com/search_query_logs.html) is used for creating the query pairs. In this dataset, we have a list of queries followed by user clicked link, rank of the link clicked and timestamp for every query. Queries in the dataset with frequency greater than 150 times are considered as Head queries and similarly, queries with frequency less than 5 are considered as Tail queries. One of the baseline approaches has these figures as 199 and 2 respectively. But on careful observation of the query log we find 150 and 5 as better thresholds.

  2. Implementation of the baseline paper, Bringing head closer to the tail with entity linking by Verma et al.
    This paper describes a entity linking based approach to map tail queries to head queries. The authors use an entity linking tool - Dexter to discover the set of entities from a query. Dexter identifies chunks in the query (spots) and links them to corresponding Wikipedia Titles (entities) associating a link probability. To solve the problem of mapping, entities of a tail query t and each head query hi are identified using Dexter and a Jaccard similarity is computed between the two entity sets to score each hi for the current tail query t. Top k head queries based on this entity linking score obtained as a rank list for a given tail query.

  3. THMapNet - A Neural Model for establishing tail and head query mapping
    The dataset created provides us with pairs of relevant tail and head queries (referred to as positive pairs) and pairs of non-relevant tail and head queries (referred to as negative pairs). Based on the DSSM idea of feature learning through neural networks, we propose THMapNet, a set of deep neural network architectures which learn relevance between tail queries and corresponding head queries.

  4. User Study
    As the dataset created is purely on the basis of heuristics, it may considered as pseudo-gold benchmark rather than a true gold standard. Hence, it is important to validate the performance of our approach and baseline approaches through human evaluation. For this purpose, we propose to conduct a user study which would help us evaluate performance of systems used for tail to head query mapping. Also for the user study we compare one of the baselines we implemented (as described in Point 2 above) with the THMapNet_v1 architecture.