Week 04 05 (W49 50 Dec7 Dec14) London RE - Rostlab/DM_CS_WS_2016-17 GitHub Wiki
Week 04/05 Summary:
- Created new features that are usually essential for making any sort of conclusion regarding property dealings, i.e. area.
- Detected certain problems that arise due to human description entry for the listings. These need to be dealt with over the coming weeks to give a more concrete listing comparison possibility.
- Description Analysis: Bigrams and Trigrams, a start to looking into description generation
Details of work done in Week 04/05:
New features and problems faced:
Our dataset did not explicitly provide us with area measurements of the listings. Usually the agents had put in the area of the properties being advertised in the descriptions. Using string manipulation, we have managed to extract these measurements and record them as separate features. How we did this is that we searched for certain pattern of strings in the description that are used for denoting the unit of measurement for the area in the description, and extracted the numeric values preceding that pattern. As an example of what we did, consider this description (pre-processed) of a property: "...The apartment itself is a great size spread over 514 sqft or 48 sqm located on the fourth floor The apartment shows a high end finish throughout..." So basically we searched for the string pattern of sqm and whatever numerical value precedes this pattern was recorded as a separate feature denoting area in sqm.
Since most of the agents only used one of the two units of measurement to convey the area, we created 2 separate features namely sqm and sqft to record their respective numerical values. Once this one done, for those listings that only had one of the two features populated, we calculated and filled in the other feature using the relation 1 sqm = 10.7639 sqft.
However, only about 35% of our listings contained area measurements in the descriptions. We still have to find a work around for this problem. Because this information is vital. This can be, however, due to the numerous possible ways that the agents could have mentioned the area of their listings, that is not readily picked up as a pattern, but can be understood by a human when read. For instance, an agent in one of the listings had mentioned the area as "five-hundred s ft", practically speaking there are many possible ways to mention this, and we are still looking for a way to solve this problem.
Description Analysis
Significance
Analyzing the significant content of the individual descriptions (i.e. non stopwords), it was found that around an average of 61.63% of the descriptions are significant. What was defined as stopwords were taken from the list included in the NLTK.
Distance analysis
Last report we noted that some of the regularly recurring postings have similar descriptions. We have gotten the top 20 occurring descriptions and analyzed the distance between them using the NLTK. The distance is measured by the amount of character changes for the first description to turn into the second description. This is then divided by the length of the original description to give a percentage of the change. You can see from the image below how similar the descriptions might be.
The color indicates that the green ones are the most similar - with about 70% or less difference. Yellow indicates below 100% and red indicates above 100%. Red generally occurs when the 2nd description is way longer than the first one. It can be noted that most of the similar descriptions dominate the top most common identical postings.
Bigrams
We are looking into generating a property description through the use of Bigrams. Using the conditional frequency distribution in the NLTK, we can generate a a text block using a seed word what is most commonly following it. For a range of 20 words and the seed word "the":
The property is a large reception room , a large reception room , a large reception room , a large
using "avaliable"
avaliable now . The property is a large reception room , a large reception room , a large reception room
using "*" - a punctuation marker found in the ads
* free weekly cleaning service or make the heart of the heart of the heart of the heart of the
Unfortunately, using the most common word generally resulted in a loop as seen above, mostly at "a large reception room" or " the heart of". We are looking to next base the text generation randomly over available words but this will have to be weighted accordingly. At the moment we are deciding how to accurately weigh following words and how to combine bigrams and/or trigrams so that the generated descriptions will make sense.
Trigrams
Continuing with the analysis on N-grams of the property descriptions, we have started to analyze Trigrams. Because of the common occurrence of stopwords as the N-grams get larger, the frequency distribution with built with the stopwords. The table as follows include stopwords but do not include punctuation.
Trigram | Frequency |
---|---|
the heart of | 8234 |
in the heart | 7265 |
the property is | 5789 |
reception room with | 5400 |
fully fitted kitchen | 5154 |
a short walk | 5059 |
party or make | 4646 |
the same available | 4646 |
two double bedrooms | 4424 |
to offer this | 4239 |
As one can observe the frequency of the most common trigrams are much lower than that of the bigrams. Also it can be noted that our predictions of common trigrams such as "fully fitted kitchen" or "two double bedrooms" from analyzing the bigrams are correct.
With information from bigrams of the previous week (without stopwords) and this week's trigrams, it can be seen that more or less the same information is apparent. What is new, however, is that it is important for the listing to be "in the heart of [London, or specific neighborhood]".
Length of property descriptions
The descriptions of the agents ranged between 9960 and 0 characters (mean = 724; median = 597). We calculated the Pearson Correlation Coefficient of the length of the descriptions and the prices, because we had the hypothesis that the higher the price of a property the longer the description. The agents would put much more effort in finding tenants for pricier objects than for less expensive ones.
The outcome was a coefficient of cor = 0.038, what implies a slight positive correlation. It can be shown that a higher price implies a longer description, but the correlation was not as strong as we expected it to be.
Furthermore we had a look at the correlation of the price and the number of bedrooms, where we could observe a value of cor = 0.046, what depicts a stronger correlation than in the scenario above. The number of bedrooms of a property seems to correlate more with it's price than the length of it's description.
Outlooks
We hope to be able to use these bigram and trigram analysis to generate some ideal property descriptions.