Proposal - Kelvin-Zhong/Click-Through-Rate-Prediction GitHub Wiki

Project Proposal

Team Name: Observer

Member:

XIANG ZHONG 204412666
Yang Pei 304434922
Hongbo Zhao 604426609
Qianwen Zhang 004401414
Jing Zhao 404426610
Zhe Sun 604435430

Background:
In online advertising, click-through rate (CTR) is a very important metric for evaluating ad performance. As a result, click prediction systems are essential and widely used for sponsored search and real-time bidding.

Problem Definition:
11 days worth of Avazu data is provided to build and test prediction models. 10 days are the training set, and 1 days are for the testing set for testing our model.

In addition, each record contains different features: id: ad identifier, click: 0/1 for non-click/click, hour: format is YYMMDDHH, so 14091123 means 23:00 on Sept. 11, 2014 UTC, C1 -- anonymized categorical variable,banner_pos,site_id,site_domain,site_category,app_id,app_domain,app_category,device_id,device_ip,device_model,device_type,device_conn_type,C14-C21 -- anonymized categorical variables.

What we are going to do is to build a model that best fits the data and therefore have a good prediction on the Click-Through Rate Prediction on the ads. And the performance will be measured by Logarithmic Loss of the probabilities of all the ads being clicked.

Survey of Related Work:
First of all, based on our common sense, whether a ads will be clicked hugely based on the connections between the content of the ads and the interests of the users. Therefore, we have to search or do some survey about how these connections may influence the click rate. Secondly, regardless of the contents of ads, the place, the time and the devices may also have an influence to the user to click the ads, which is also a survey point. Last but not least, in order to build an useful model to correctly represent the data and trends, we have to do some survey about which kinds of modeling approaches can be used or are suitable to our data.

Outline of Approach:
Our team will first try to read some related survey to find out what basic approaches or algorithms can be used in this data. And we will try to implement some of them to see the performance. Once we get a bench mark or basic line of the performance, we will try to continuously improve the performance by trying some different ideas based either on some approaches in other material or our common sense about the data.