Week 02 (W47 Nov16) London RE - Rostlab/DM_CS_WS_2016-17 GitHub Wiki
Zoopla api Issues - workaround for requests
When requesting the data from the zoopla api we ran into the issue that page numbers > 100 result in status code '400' and cannot be processed. That would mean we could only fetch data for the first 10k properties. As a workaround we included the postcode parameter in the url and are now requesting results per postcode. The list with all postcodes from London can be found at wikipedia. That will hopefully allow us to get all the data we need.
Final dataset
The final dataset can be found here: https://drive.google.com/open?id=0B2WhEEEx5z8zcmt6dXViTS11Sk0
The dataset contains 55.313 property instances with 35 attributes for each entry (there are almost no empty cells in the dataset). The attributes are the following:
rental agent
- agent_address
- agent_logo
- agent_name
- agent_phone
property location
- category (note: all properties are 'residential' and no 'commercial')
- country (note: all 'England')
- country_code (note: all 'gb')
- county (note: all 'London')
- latitude
- longitude
- outcode
- post_town (note: all 'London')
- street_name
Locations of listings:
From this map plotting it is fairly easy to conclude that the majority of the listings are located in the center of London. Which leads to another conclusion that this is the reason why the average of prices for these listings is higher than the general average.
Apart from a few outliers, most of the agents are located in and around the center of London.
property features
- available_from_date
- description
- short_description
- details_url
- displayable_address
- letting_fees
- listing_status (note: all 'rent' and no 'sale')
- num_bathrooms
- num_bedrooms
- num_floors
- num_recepts
- price (price per week)
- rental_prices_accurate
- rental_prices_per_month
- rental_prices_per_week
- property_type
- status
listings metadata
- first_published_date
- last_published_date
- property_report_url
Dataset cleanup
- The textual descriptions were present in html format. Since the tags would not have really served any actual purpose for the data mining, the data was cleaned up and all html tags were removed. This way there is now relatively lesser content to go over, improves readability and lowers chances of errors due to irrelevant text presence.
- around 1700 agents/offices in total were responsible for all of the listings in the dataset.