Project - Gnkhakimova/CS5590-BigData GitHub Wiki

House Prices

Team members:

Han Zhou
Gulnoza Khakimova
Yujing Wu

Project proposal

Goals and Objectives

Motivation

Big Data analysis is becoming more popular year by year. Trained models are helping to make predictions at areas like medicine, auto industry, education, media etc. For our project we decided to implement Prediction of House Prices. Create a model which will be used in Real Estate Investment.

Significance

Created model can be used to predict house prices, it might be a good tool for Real Estate Investors who might look into house price predictions before buying a house and making an investment, which might help them to save or earn money. Single house buyers also can used model before purchasing a house in order to know if they are buying not overpriced house for their area. As we know real estate market is blooming, which means that property prices are very high.

Objectives

With 79 variables which describe the features about house, we try to predict the final price of each home. It means we create a prediction machine, give a data set which contains the price of the home as the input to train the machine. And next, when there is a new buyer, we can input the features and predict the price for the house from the machine we have trained before.

System Features

System features of this project will include prediction of house’s price. The prediction will base on the given dataset which is the training set. After training the machine, it can predict the final price of each home.

Increment 1

Dataset

File descriptions
We have two data set files. One is train.cvs which store data for training our model. And also we have test.cvs which collect data for validate the accuracy of our data.

train.csv - the training set
test.csv - the test set

In training data set file , we have 80 columns and 1460 rows. Each column is a unique feature. And in testing data file, we have 81 columns and 1459 rows. Each column is a unique feature. But the problem here is that, in our testing data file, we do not have the prediction feature called “SalesPrice”. It lead to a situation that we can not evaluate our results. So we give up our original testing data set. And try to divide the original training data set into two parts: training set , validation set.

The amount of new training data is 70 percent of the original one. That is to say we have 1022 rows and 81 columns in training set.

The amount of new testing data is 30 percent of the original one. In other words we have 438 rows and 81 columns in validation set.

Detail design of Features

Here's a brief version of what you'll find in the data description file.

Id.
SalePrice - the property's sale price in dollars. This is the target variable that you're trying to predict.
MSSubClass: The building class
MSZoning: The general zoning classification
LotFrontage: Linear feet of street connected to property
LotArea: Lot size in square feet
Street: Type of road access
Alley: Type of alley access
LotShape: General shape of property
LandContour: Flatness of the property
Utilities: Type of utilities available
LotConfig: Lot configuration
LandSlope: Slope of property
Neighborhood: Physical locations within Ames city limits
Condition1: Proximity to main road or railroad
Condition2: Proximity to main road or railroad (if a second is present)
BldgType: Type of dwelling
HouseStyle: Style of dwelling
OverallQual: Overall material and finish quality
OverallCond: Overall condition rating
YearBuilt: Original construction date
YearRemodAdd: Remodel date
RoofStyle: Type of roof
RoofMatl: Roof material
Exterior1st: Exterior covering on house
Exterior2nd: Exterior covering on house (if more than one material)
MasVnrType: Masonry veneer type
MasVnrArea: Masonry veneer area in square feet
ExterQual: Exterior material quality
ExterCond: Present condition of the material on the exterior
Foundation: Type of foundation
BsmtQual: Height of the basement
BsmtCond: General condition of the basement
BsmtExposure: Walkout or garden level basement walls
BsmtFinType1: Quality of basement finished area
BsmtFinSF1: Type 1 finished square feet
BsmtFinType2: Quality of second finished area (if present)
BsmtFinSF2: Type 2 finished square feet
BsmtUnfSF: Unfinished square feet of basement area
TotalBsmtSF: Total square feet of basement area
HeatingQC: Heating quality and condition
CentralAir: Central air conditioning
Electrical: Electrical system
1stFlrSF: First Floor square feet
2ndFlrSF: Second floor square feet
LowQualFinSF: Low quality finished square feet (all floors)
GrLivArea: Above grade (ground) living area square feet
BsmtFullBath: Basement full bathrooms
BsmtHalfBath: Basement half bathrooms
FullBath: Full bathrooms above grade
HalfBath: Half baths above grade
Bedroom: Number of bedrooms above basement level
53 Kitchen: Number of kitchens
KitchenQual: Kitchen quality
TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
Functional: Home functionality rating
Fireplaces: Number of fireplaces
FireplaceQu: Fireplace quality
GarageType: Garage location
GarageYrBlt: Year garage was built
GarageFinish: Interior finish of the garage
GarageCars: Size of garage in car capacity
GarageArea: Size of garage in square feet
GarageQual: Garage quality
GarageCond: Garage condition
PavedDrive: Paved driveway
WoodDeckSF: Wood deck area in square feet
OpenPorchSF: Open porch area in square feet
EnclosedPorch: Enclosed porch area in square feet
3SsnPorch: Three season porch area in square feet
ScreenPorch: Screen porch area in square feet
PoolArea: Pool area in square feet
PoolQC: Pool quality
Fence: Fence quality
MiscFeature: Miscellaneous feature not covered in other categories
MiscVal: $Value of miscellaneous feature
MoSold: Month Sold
YrSold: Year Sold
SaleType: Type of sale
SaleCondition: Condition of sale
Heating: Type of heating

Analysis

We set up the whole work into four parts. At each one we do some related work to deal with the data so that we can generate better results from the data.

In the first place, we have to analyze our original data. Since there might be some features which do not have some relationships with the dependent variable. We have to plot or calculate to find out the most mattered features to apply in our further research.

Second, we need to do some pretreatment to our original data. For there is a huge amount of data, null values or noise values are unavoidable. To solve this, we can take some measures to remove or replace the null values or noise values. After cleaning the data, model will be easier to fit and we will get a better result.

Next, we can apply the data to train a model. We can use 79 features of them to be independent variable. And 1 feature is dependent variable—SalesPrice. We use training set to train the model so that when we give the test set as input the model can return a prediction of dependent variable.

Finally, we can evaluate our results. The last step, we can compare the training results and the real value. Thus we can get a result about the accuracy of the model. And the higher the accuracy could be, the better our model would be.

Implementation

We load the data. And do some preparation for the pretreatment of data. First of all, we combine test set and train set to do EDA of features and Check missing value. Columns that contain missing value are shown below.There are 35 columns contain missing value (includes salePrice). Below is list of rows which contain missing data.

Code below shows how to check missing data:

After identifying missing data we need to fill them out.

Based on data description, we found that the missing value in the following columns simply means 'None'.

Also, some of the 'None' value need to be expressed as 0.0

Then we fill the list with most common term since there are only a few missing value and replace null value with 0.

Since we have told before, original test set do not have target column. We use two ways to split the data. One is

Pic2
The other one is

Next, we do some analysis on the data.
Lets display all data features that we have in matrix form
Below is a list of top 10 heatmap, features which actually will be needed in predicting house prices.

Now we will visualize each feature in order to clean and organize data before training.

Overall Quality vs Sale Price

Everything looks good, better quality of the house result in higher price.
Living Area vs Sale Price
Looks good except those two dots which does not make a sense, we can remove them and then we can check if it increased Pearson's correlation.

Yes, Pearson's correlation increased after cleaning the data.
Garage Area vs Sale Price
Removing outliers manually (More than 1000 sqft, less than $300k) and check Pearson's correlation
Garage Cars vs Sale Price
Removing outliers manually (More than 4 cars, less than $300k)
Basement Area vs Sale Price

Looks good. no need to clean it.
First Floor Area vs Sale Price
Total Rooms vs Sale Price

After analysis and pretreatment on original data, we finally get our valid input set for fitting the model. We use two approaches to do apply our ideas. The first one is to call function RandomForestRegressor() from skit-learn library. The second one is to define the function by our selves. For each function, we fit the model with training set and evaluate it with validation set.

Call RandomForestRegressor() implementation

Below is calculation of Pearson correlation after cleaning the data:

Code

In this function, we use drop to clean the data. We choose three top mattered features to analyze and remove the rows with null values. Then we train_test_split function to split our data to get train set and validation set. Next, we fit the model with train data. Finally, we print out the score of the model using score function of metrics library.

Result
The first part of the results is the prediction of the salesPrice.

The second part is the score of the model.

Define RandomForestRegressor() implementation

In this implementation, we define the function by ourselves. As the same as the above one, we print out the score of the model using score function of metrics library.
Code:

Result:

Preliminary Results

The score of the model from the skl-learn library reach up to 0.94, which is high enough to prove the model is excellent and can get a great prediction.

Project Management

Implementation status report

In the first place, we can not apply the function. Because there are so many null values. That makes our input set invalidate. To fix this

Activity Result

Run model with input data set. Error: the input is invalid
Run model with input data set. Error: the input is invalid
Run model with input data set. Error: No module named sklearn
Run model with input data set. Error: No module named pd 5 Run model with input data set . Successful.
W pretreatment the data bases on some fundamental analysis. After that we met some problems for importing packages in the IDE. The IDE provide easy way to solve this problem. And then we can run the class successfully. Work completed: Description

In the first place, we finish analyzing our original data. We display some plots and calculation. And we find out the most mattered features to apply in our further research.

Second, we have done some pretreatment to our original data. For there is a huge amount of data, null values or noise values are unavoidable. To solve this, we take some measures to remove or replace the null values or noise values.

Next, we apply the data to train a model in two approaches. We use training set to train the model and give the test set as input. For each method, we print out the prediction of dependent variable and the sores of the model.

Finally, we evaluate our results. The last step, we can compare the training results and the real value. Thus we print out a result about the accuracy of the model. And the higher the accuracy could be, the better our model would be.

Responsibility (Task, Person)

Gulnoza(Class ID: 13 ): Analyzed original data, cleaned dataset, implemented random forest algorithm to train our model, tested the model.
Han Zhou(Class ID: 30): Done pretreatment of the original data. Use RandomForestRegressor() function to implement the idea. Give evaluate scores on the results. Help finishing report.
Yujing Wu(Class ID: )

Contributions (members/percentage)

Work to be completed

Description

We have to keep on fixing the model. Through adjust the parameters to improve our own model. Also, we can figure out some deep relationship in the data so that we can imply more functions to obtain more significant information.

Issues/Concerns

Some problems related with big data will be hard to make sense and even harder to imply them. Thus it is quite important for us to define good questions and use what we have learned from class flexibility.