Week 03 - Summary:

Numerical Derivations/Visualizations: This week we focused on gathering useful numerical derivations from our dataset. In turn, to be used for visualizations that would enable a deeper insight into the what our data offers and what further crunching can be done.
Data cleanup: Further text pre-processing on the textual description was done and appropriate visualizations developed.
Feature derivation: Apart from that we have also discovered that there are some important features that are not distinctly available in our dataset, but can be derived for the listings from the description.
Description analysis: Analysis of the descriptions have started to be done in the form of n-grams.

Details of work in week 03:

More information about the listings

We have decided to analyze the type of housing for each ad.

Type	Number
Flat	41227
Studio	4089
Terraced house	2331
Maisonette	1069
Semi-detached house	1004
Detached house	665
Town house	365
End terrace house	306
Mews house	238
Parking/garage	91
Cottage	88
Block of flats	38
Bungalow	33
Link-detached house	21
Barn conversion	18
Land	12
Detached bungalow	8
Houseboat	6
Lodge	6
Retail premises	5
Office	2
Semi-detached bungalow	2
Farmhouse	2
Equestrian property	2
Restaurant/cafe	1
Terraced bungalow	1

As one can see above, the vast majority of the ads listed are for flats. This is not surprising as most of the listings are for apartment buildings in the center of London. It can be interesting to note that there are a variety of types of houses and bungalows. Also included are the very small amount of listings for commercial spaces (Retail premises - 5, Office - 2, Restaurant/cafe - 1) as well as parking spaces (91).

Numerical Derivations/Visualisations:

This week we focused on making numerical derivations to be able to make concrete conclusions and gain a better understanding of the data at hand. These derivations would help us with the future goals of our project, for this purpose, the following values were defined:

Average price of listings by each agent The analysis showed that there are agents who offer pricier properties than others. The average price ranged between 25.241£ and 100£. Overall there were 2618 different agents

Acg rental price per agent

Average price of listings per N number of bedrooms The number of bedrooms ranged from 1 - 29. The majority of properties had 2, 1 or 3 bedrooms. There was one object with 29 bedrooms. If one is interested in how that flat looks like - you can see the ad here

This image shows the distribution of the rental price per month according to the number of bedrooms. avg rental price num bedrooms

The distribution of the number of bedrooms is shown in the following image. dist number of bedrooms

Average price of listings We also had a look of how the prices were influenced by the location of the property. The following image shows that there are pricier regions than others.

The most expensive part of London is the district 'W1' with an average letting fee of 8105£ per month while the cheapest region is 'SE06' where on has to pay 'only' 1027£ per month. Overall one can say that London is an expensive city to live compared to Munich - no matter where.

avg rental prices per postcode

For a better visualisation one can see here the postcode map of wikipedia.

Further Data pre-processing:

Last week the html tags had been removed from our dataset to allow easier understanding. However, further processing was still required to be able to gain a better grip. This week the following data cleaning was done:

Eliminating special characters: There were certain special characters that were just taking up space in the text and would have been a hurdle in the way during the descriptive and analysis phase. For instance, some agents used "____" to separate paragraphs in the property descriptions that they wrote. The following characters were eliminated in the long and short property descriptions: a) _ b) ! c) * d) . e) , f) ( and ) g)
Removing stop words: In order to remove words that do not add substance to the description, stop words were removed from the property descriptions. Some further tweaking is, of course, still needed, and a word cloud for the remaining descriptions words was done by use of converting all strings into Bag of Words and term frequency. A threshold level of minimum frequency was used to filter the top most occurring words and then used for making the word cloud, which gave the following result:

Word cloud

Feature derivations: In case of properties, the per unit area measurements is very important for making meaningful conclusions about the listing. However, our data does not distinctly provide this information. However, a lot of the agents in their ads have mentioned the area measurements of the property in the property descriptions. We decided to extract this data from the description and record it as new feature for the properties in our dataset. There are, however, certain issues that we face in taking this approach.

Description Analysis:

Initial comparisons

We know that descriptions are often reused by agents to advertise for apartments in the same building. This was a quick value count on how many idential descriptions were present our data set.

The highest amounts of identical descriptions were: (top 10) 63, 47, 38, 29, 18, 16, 13, 12 ,12, 11.

However a lot of the top repeated descriptions were very similar (as in same descriptions with slight changes such as agent name). This might indicate that they might be scam adverts or just really lazy agents. Furthur analysis on this will be done.

Bigram Analysis

Continuing with the analysis of the ad descriptions. We have started compiling a list of bigrams. Eventually in the future weeks we may consider making trigrams, quadgrams, or other N-grams. The bigram was compiled using the [Natural Language Toolkit] (https://www.nltk.org).

Taking out stop-words, the ten most common bigrams are as follows:

Bigram	Frequency
double bedroom	16779
reception room	14036
double bedrooms	13179
bedroom apartment	11561
open plan	11448
fitted kitchen	11244
fully fitted	8585
transport links	7329
bedroom flat	6383
ground floor	6248

Observing from the chart, mentioning bedrooms are the most common bigrams in the descriptions. It ranges from double bedroom(s) to (n)-bedroom apartment. This is consistent to the number of bedroom distribution pictured above (in that most ads have either 1 or 2 bedrooms). Also clocking at number 8 most common bigram is a mention towards the public transit which is the only term that does not describe the apartment, merely its surroundings. This indicates that analysis for the listing distance to transport links will have to be conducted later. The term 'open plan' at number 5 may mean a mention towards the style of apartments (assuming it means an apartment with a more spacious feeling). Also it can be noticed that a (fully) fitted kitchen is also important in the adverts.

Week 03 (W48 Nov16) London RE - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

Week 03 - Summary:

Details of work in week 03:

More information about the listings

Numerical Derivations/Visualisations:

Further Data pre-processing:

Description Analysis:

Initial comparisons

Bigram Analysis

⚠️ GitHub.com Fallback ⚠️

Week 03 (W48 Nov16) London RE - Rostlab/DM_CS_WS_2016-17 GitHub Wiki

Week 03 - Summary:

Details of work in week 03:

More information about the listings

Numerical Derivations/Visualisations:

Further Data pre-processing:

Description Analysis:

Initial comparisons

Bigram Analysis

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️