Weekly Schedule of Observer - Kelvin-Zhong/Click-Through-Rate-Prediction GitHub Wiki
Week 7 Feb. 22 Meeting
-
Getting hand on the dataset
-
Preprocessing of the data
A. Use the first 6 days as training data set, and the 7th data as testing dataset.
B. Keep the following 6 attributes:click,banner_pos,site_category,app_category, device_ip,device_model,device_type.
(The ip address should be prased into location code) -
Appling algorithm to the data
Week 4 Feb. 1 Meeting
- Proposal and Presentation
Check out how is the proposal and presentation schedule and like (time duration, summary format. ect.) - Paper
Since there are much content in the paper and the presentation and summary cannot be so long, we just select a few of part of the paper works to present and write in the summary. Week 5: look through the paper, and then have a meeting together to assign related work and some other part to different teammates.
Week 6: all teammates work on the paper presentation, writing the summary. - Project
After finishing paper, do project.
Idea:
(1) first do dimension reduction by filtering out some useless features using frequent pattern mining. (2) Using some other classification algorithm based on the filtered features.
Week 4 Jan. 27
-
Check and get familiar with the project and paper: Project: Predict whether a mobile ad will be clicked. (https://www.kaggle.com/c/avazu-ctr-prediction)
Paper: Dynamics of News Events and Social Media Reaction. (http://disi.unitn.it/~themis/publications/kdd14.pdf)
Download the "test" part of the dataset and have a look. (since the "train" part of the dataset is so large, handle it after meeting) -
Meeting:
Meeting为星期天(Jan. 31) 晚上,内容:讨论分工,写proposal, 交流一下对project,paper的意见. -
Tools for implementation:
If nothing unexpected, we shall use Python as our primary programming language. If you don't know Python, just spend 10 mins to check the "Python 101" part in this webpage (https://course.ie.cuhk.edu.hk/~engg4030/tutorial/tutorial3/)
Also, iPython and iPython Notebook are improved versions of Python, which are more convenient for development.
Comment:
Kelvin: Dataset有超过1千万条数据, 因此处理大数据可能会遇到些困难.
As for the paper: too much contents, for presentation, if it is short, don't need to present everything.
And for the project: we can start from the "All 0.5 Benchmark 0.6931472"(sample submission).
About team work assignment: 3 people for the paper, 3 people for the project, after finishing paper work, join into project work.