Dataset London's Real Estate Market - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

Dataset 14 / Carola Boettcher

  • Proposer: Carola Boettcher - @carockets - [email protected]

  • Team members: 0. Carola Boettcher - @carockets 0. Charles Lee @lonz 0. Muhammad Asad @muhammadasad1

  • Project-Repo: The repository for our code etc can be found here

  • Slack: We have a slack chat for organisational issues which can be found here

  • Votes 🗳:

Summary

London's Real Estate market is known famously for it's non-stop increasing prices. It's one of the most expensive housing markets in the world. One could use data mining techniques to have a deeper look in whats going on and identify many interesting facts about the London housing market.

Prediction Goals

  • Which words (n-grams) are used most often to describe the objects?
  • How do the descriptions differ according to the object's price/location/...?
  • Can one predict the price of an object by looking at the description?
  • Is it possible to identify scam ads?
  • (other)

Weekly Progress

  • Week 01 (W46-Nov16) London RE -- Main findings:
    • We should switch from rightmove.co.uk to zoopla.co.uk as they provide an easy to use api for fetching the data
    • Fetched the first ~10k property listings via the zoopla api and set up an sqlite DB
    • Inspect the data (which attributes are missing or unnecessary)
  • Week 02 (W47-Nov16) London RE -- Main findings:
    • There are some minor issues with the zoopla api which need to be fixed
    • We inspected and cleaned the data (removed html tags,...)
    • Made first simple plots for visualising the data
  • Week 03 (W48-Nov16) London RE -- Main findings:
    • calculated number of different property types
    • calculated avg prices per region (postcode), number of rooms, agents
    • started compiling bigrams using the Natural Language Toolkit
  • Week 04/05 (W49/50 Dec7/Dec14) London RE -- Main findings:
    • New features developed/There are some problems being faced due to the factor of human entry into the descriptions
    • Looked at correlations between length of description and price of objects
    • Further analysis on Bigrams and initial looks at Trigrams.
  • Week 06 (W51 Dec21) London RE -- Main findings:
    • We are beginning our look into predictive tasks.
  • Week 08 (W2 Jan11) London RE -- Main findings:
    • Over these weeks, we have developed some more vital descriptive features for our dataset and are now looking into developing predictive features using all of the available data attributes.
  • Week 09 (W3 Jan18) London RE -- Main findings:
    • Created estimation of flat areas.
    • Starting doing price predictions using regression techniques.
    • Mined additional data for use as test data.
  • Week 10 (W4 Jan25) London RE -- Main findings:
    • Tried last week's price predictions but on sale prices.

Long Description

The dataset would provide information about an object's (not final):

  • location
  • price
  • number of rooms
  • heating type
  • year of construction
  • facilities
  • floor
  • balcony
  • distance to public transport

Links / Data / Other

  • The dataset would be generated by crawling the ads of rightmove.co.uk. It should not be a problem according to their terms of service. There are various scripts available (e.g. in Python) which could be used to achieve this task. After the parsing and the extraction of the content of the sites. one would set up a database to store the retrieved data.

  • Edit 19/11/2016 The final dataset can be found here as CSV and Sqlite: https://drive.google.com/open?id=0B2WhEEEx5z8zcmt6dXViTS11Sk0

Final Presentation

The final presentation can be found here: https://github.com/carockets/DataMiningLondonRE/blob/master/London%20RE%20Final%20Presentation.pdf