Week 03 (W48 Nov16) London RE - Rostlab/DM_CS_WS_2016-17 GitHub Wiki
- Numerical Derivations/Visualizations: This week we focused on gathering useful numerical derivations from our dataset. In turn, to be used for visualizations that would enable a deeper insight into the what our data offers and what further crunching can be done.
- Data cleanup: Further text pre-processing on the textual description was done and appropriate visualizations developed.
- Feature derivation: Apart from that we have also discovered that there are some important features that are not distinctly available in our dataset, but can be derived for the listings from the description.
- Description analysis: Analysis of the descriptions have started to be done in the form of n-grams.
We have decided to analyze the type of housing for each ad.
Type | Number |
---|---|
Flat | 41227 |
Studio | 4089 |
Terraced house | 2331 |
Maisonette | 1069 |
Semi-detached house | 1004 |
Detached house | 665 |
Town house | 365 |
End terrace house | 306 |
Mews house | 238 |
Parking/garage | 91 |
Cottage | 88 |
Block of flats | 38 |
Bungalow | 33 |
Link-detached house | 21 |
Barn conversion | 18 |
Land | 12 |
Detached bungalow | 8 |
Houseboat | 6 |
Lodge | 6 |
Retail premises | 5 |
Office | 2 |
Semi-detached bungalow | 2 |
Farmhouse | 2 |
Equestrian property | 2 |
Restaurant/cafe | 1 |
Terraced bungalow | 1 |
As one can see above, the vast majority of the ads listed are for flats. This is not surprising as most of the listings are for apartment buildings in the center of London. It can be interesting to note that there are a variety of types of houses and bungalows. Also included are the very small amount of listings for commercial spaces (Retail premises - 5, Office - 2, Restaurant/cafe - 1) as well as parking spaces (91).
This week we focused on making numerical derivations to be able to make concrete conclusions and gain a better understanding of the data at hand. These derivations would help us with the future goals of our project, for this purpose, the following values were defined:
- Average price of listings by each agent The analysis showed that there are agents who offer pricier properties than others. The average price ranged between 25.241£ and 100£. Overall there were 2618 different agents
- Average price of listings per N number of bedrooms The number of bedrooms ranged from 1 - 29. The majority of properties had 2, 1 or 3 bedrooms. There was one object with 29 bedrooms. If one is interested in how that flat looks like - you can see the ad here
This image shows the distribution of the rental price per month according to the number of bedrooms.
The distribution of the number of bedrooms is shown in the following image.
- Average price of listings We also had a look of how the prices were influenced by the location of the property. The following image shows that there are pricier regions than others.
The most expensive part of London is the district 'W1' with an average letting fee of 8105£ per month while the cheapest region is 'SE06' where on has to pay 'only' 1027£ per month. Overall one can say that London is an expensive city to live compared to Munich - no matter where.
For a better visualisation one can see here the postcode map of wikipedia.
Last week the html tags had been removed from our dataset to allow easier understanding. However, further processing was still required to be able to gain a better grip. This week the following data cleaning was done:
-
Eliminating special characters: There were certain special characters that were just taking up space in the text and would have been a hurdle in the way during the descriptive and analysis phase. For instance, some agents used "____" to separate paragraphs in the property descriptions that they wrote. The following characters were eliminated in the long and short property descriptions: a) _ b) ! c) * d) . e) , f) ( and ) g)
-
Removing stop words: In order to remove words that do not add substance to the description, stop words were removed from the property descriptions. Some further tweaking is, of course, still needed, and a word cloud for the remaining descriptions words was done by use of converting all strings into Bag of Words and term frequency. A threshold level of minimum frequency was used to filter the top most occurring words and then used for making the word cloud, which gave the following result:
- Feature derivations: In case of properties, the per unit area measurements is very important for making meaningful conclusions about the listing. However, our data does not distinctly provide this information. However, a lot of the agents in their ads have mentioned the area measurements of the property in the property descriptions. We decided to extract this data from the description and record it as new feature for the properties in our dataset. There are, however, certain issues that we face in taking this approach.
We know that descriptions are often reused by agents to advertise for apartments in the same building. This was a quick value count on how many idential descriptions were present our data set.
The highest amounts of identical descriptions were: (top 10) 63, 47, 38, 29, 18, 16, 13, 12 ,12, 11.
However a lot of the top repeated descriptions were very similar (as in same descriptions with slight changes such as agent name). This might indicate that they might be scam adverts or just really lazy agents. Furthur analysis on this will be done.
Continuing with the analysis of the ad descriptions. We have started compiling a list of bigrams. Eventually in the future weeks we may consider making trigrams, quadgrams, or other N-grams. The bigram was compiled using the [Natural Language Toolkit] (https://www.nltk.org).
Taking out stop-words, the ten most common bigrams are as follows:
Bigram | Frequency |
---|---|
double bedroom | 16779 |
reception room | 14036 |
double bedrooms | 13179 |
bedroom apartment | 11561 |
open plan | 11448 |
fitted kitchen | 11244 |
fully fitted | 8585 |
transport links | 7329 |
bedroom flat | 6383 |
ground floor | 6248 |
Observing from the chart, mentioning bedrooms are the most common bigrams in the descriptions. It ranges from double bedroom(s) to (n)-bedroom apartment. This is consistent to the number of bedroom distribution pictured above (in that most ads have either 1 or 2 bedrooms). Also clocking at number 8 most common bigram is a mention towards the public transit which is the only term that does not describe the apartment, merely its surroundings. This indicates that analysis for the listing distance to transport links will have to be conducted later. The term 'open plan' at number 5 may mean a mention towards the style of apartments (assuming it means an apartment with a more spacious feeling). Also it can be noticed that a (fully) fitted kitchen is also important in the adverts.