Week 10 (W4 Jan25) London RE - Rostlab/DM_CS_WS_2016-17 GitHub Wiki
Summary
Local distribution of sale vs. rent properties
After we downloaded the latest 10k listings for sale properties we wanted to find out if they are distributed the same way across London. So we looked at the top 10 districts which hat the most properties listed and found out the following top 10 postcodes.
Prediction of area measurements using a regression predictor
Having gained extremely inaccurate results from the classifier employed last week, this time we tried a regression model. The results were better, but not by too much of a margin.
Details of work done
Local distribution of sale vs rent properties
| sale | rent |
|---|---|
| SW8 | NW8 |
| W2 | E14 |
| SE1 | NW1 |
| NW3 | NW3 |
| NW8 | NW6 |
| SW6 | SW3 |
| NW1 | W8 |
| SW7 | W2 |
| W14 | E1 |
| SW3 | SE1 |
London's postcodes are encoded with the four cardinal points north (N), east (E), south (S) and west (W) so NW would mean north-west. As one can see, most of the sale and rent properties are located in the west of London. Most of the sale properties can be found in south-western direction while the rent properties are mainly located in the north-western direction.
Further progression of price predictions
We are looking into the data for more factors to take into account for the regression.
Addition of Postcode in regression
We noticed that different postcodes can result in different pricing. We have added the postcodes with a simple average value into the regression analysis. This was done by taking the average price for each postcode and adding a variable for each listing for that numeric value.
As we ran the regression analysis (similar to last week) with the newly added postcode values, we have been able to achieve a closer prediction. For instance our mean square error decreased to 220916 (meaning an average of 470 pounds off from the price). In addition we also decided to check the percent error of each prediction compared to the actual listing. This was found to range from 0.01% off to 95% off with an average of 23% off. Being a third off on predictions is still slightly high.
Other types of regression
We are looking into other types of regression such as the Gaussian Process. Because running the Gaussian Process regression fit ended up taking a much larger amount of memory than expected, we had to reduce the training set size.
Results of the Gaussian Process regression:
- MSE 78483
- RMSQ 280
- Average percent difference: 0.24
Regression on the sale price
Since we did some mining on sale prices last week. We also decided to do some analysis on the prediction of sale prices. Note that the data set for this is significally smaller (10,000 entries). Since the sale prices were all mined at the same time, we split this 80/20 into training and test data.
Results of Linear regression:
- MSE 1.11e+13
- RMSE 3338481
- Average percent difference 0.43
Results of Random Forest Regression
- MSE 1.00e+13
- RMSE 3174406
- Average percent difference 0.30
The results of this Gaussian Process regression was:
- MsE 4.44e+12
- RMSE 2106261
- Average percent difference: 0.39