Weekly Schedule of Observer - Kelvin-Zhong/Click-Through-Rate-Prediction GitHub Wiki

Week 7 Feb. 22 Meeting

Getting hand on the dataset
Preprocessing of the data

A. Use the first 6 days as training data set, and the 7th data as testing dataset.
B. Keep the following 6 attributes:click,banner_pos,site_category,app_category, device_ip,device_model,device_type.
(The ip address should be prased into location code)
Appling algorithm to the data

Week 4 Feb. 1 Meeting

Proposal and Presentation
Check out how is the proposal and presentation schedule and like (time duration, summary format. ect.)
Paper
Since there are much content in the paper and the presentation and summary cannot be so long, we just select a few of part of the paper works to present and write in the summary. Week 5: look through the paper, and then have a meeting together to assign related work and some other part to different teammates.
Week 6: all teammates work on the paper presentation, writing the summary.
Project After finishing paper, do project.
Idea:
(1) first do dimension reduction by filtering out some useless features using frequent pattern mining. (2) Using some other classification algorithm based on the filtered features.

Week 4 Jan. 27

Check and get familiar with the project and paper: Project: Predict whether a mobile ad will be clicked. (https://www.kaggle.com/c/avazu-ctr-prediction)
Paper: Dynamics of News Events and Social Media Reaction. (http://disi.unitn.it/~themis/publications/kdd14.pdf)
Download the "test" part of the dataset and have a look. (since the "train" part of the dataset is so large, handle it after meeting)
Meeting:
Meeting为星期天(Jan. 31) 晚上，内容：讨论分工，写proposal, 交流一下对project，paper的意见.
Tools for implementation:
If nothing unexpected, we shall use Python as our primary programming language. If you don't know Python, just spend 10 mins to check the "Python 101" part in this webpage (https://course.ie.cuhk.edu.hk/~engg4030/tutorial/tutorial3/)
Also, iPython and iPython Notebook are improved versions of Python, which are more convenient for development.

Comment:
Kelvin: Dataset有超过1千万条数据, 因此处理大数据可能会遇到些困难. As for the paper: too much contents, for presentation, if it is short, don't need to present everything. And for the project: we can start from the "All 0.5 Benchmark 0.6931472"(sample submission). About team work assignment: 3 people for the paper, 3 people for the project, after finishing paper work, join into project work.