Week 02 (W47 Nov16) London RE - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

Zoopla api Issues - workaround for requests

When requesting the data from the zoopla api we ran into the issue that page numbers > 100 result in status code '400' and cannot be processed. That would mean we could only fetch data for the first 10k properties. As a workaround we included the postcode parameter in the url and are now requesting results per postcode. The list with all postcodes from London can be found at wikipedia. That will hopefully allow us to get all the data we need.

Final dataset

The final dataset can be found here: https://drive.google.com/open?id=0B2WhEEEx5z8zcmt6dXViTS11Sk0

The dataset contains 55.313 property instances with 35 attributes for each entry (there are almost no empty cells in the dataset). The attributes are the following:

rental agent

agent_address
agent_logo
agent_name
agent_phone

property location

category (note: all properties are 'residential' and no 'commercial')
country (note: all 'England')
country_code (note: all 'gb')
county (note: all 'London')
latitude
longitude
outcode
post_town (note: all 'London')
street_name

Locations of listings:

location of listings plotted on Google Maps From this map plotting it is fairly easy to conclude that the majority of the listings are located in the center of London. Which leads to another conclusion that this is the reason why the average of prices for these listings is higher than the general average.

Agent locations Apart from a few outliers, most of the agents are located in and around the center of London.

property features

available_from_date
description
short_description
details_url
displayable_address
letting_fees
listing_status (note: all 'rent' and no 'sale')
num_bathrooms
num_bedrooms
num_floors
num_recepts
price (price per week)
rental_prices_accurate
rental_prices_per_month
rental_prices_per_week
property_type
status

listings metadata

first_published_date
last_published_date
property_report_url

Dataset cleanup

The textual descriptions were present in html format. Since the tags would not have really served any actual purpose for the data mining, the data was cleaned up and all html tags were removed. This way there is now relatively lesser content to go over, improves readability and lowers chances of errors due to irrelevant text presence.
around 1700 agents/offices in total were responsible for all of the listings in the dataset.