Wiki Report for ICP5 - NagaSurendraBethapudi/Python-ICP GitHub Wiki

Video Link : https://drive.google.com/file/d/17VqrIgRD3GzDLyr7LuAw-7BMsBh6y1Gu/view?usp=sharing


Question 1 :

Delete all the outlier data for the GarageArea field

Explanation :

  • Imported Libraries : 1. import pandas as pd 2. import numpy as np 2. import matplotlib.pyplot as plt
  • Imported Data : https://umkc.box.com/s/mn4mjpsq0pf0ql7prhetu534cxbxnkcu
  • Found the quartiles using boxplots
  • Removed outliers using 1. np.percentile(data.GarageArea, 25) 2.np.percentile(data.GarageArea, 75) data[(data.GarageArea>334) & (data.GarageArea<576)]
  • Output :

Question 2 : Restaurant Revenue Prediction using datset: https://umkc.box.com/s/ac6vql1s466ss2b99ifetvh9g1yuj1uj

Explanation :

  • Imported Libraries : 1. import pandas as pd 2. import numpy as np 2. import matplotlib.pyplot as plt
  • Imported Data : https://umkc.box.com/s/ac6vql1s466ss2b99ifetvh9g1yuj1uj
  1. Done basic analysis
  2. Converted string to int
  • convert = {"City Group": {"Big Cities": 0, "Other": 1}, "Type" : {"FC" : 0, "IL" : 1, "DT" : 2}}
  1. Splitting data into train and test
  2. Evaluate the performace using R2 and RMSE errors
  • print ("R squared error : \n", model.score(x_test, y_test))

  • predictions = model.predict(x_test)

  • from sklearn.metrics import mean_squared_error

  • print ('RMSE error : \n', mean_squared_error(y_test, predictions))

Output :


Question 3 : Restaurant Revenue Prediction using datset: https://umkc.box.com/s/ac6vql1s466ss2b99ifetvh9g1yuj1uj with top most correlated features

Explanation :

  • Imported Libraries : 1. import pandas as pd 2. import numpy as np 2. import matplotlib.pyplot as plt
  • Imported Data : https://umkc.box.com/s/ac6vql1s466ss2b99ifetvh9g1yuj1uj
  1. Done basic analysis
  2. Converted string to int
  • convert = {"City Group": {"Big Cities": 0, "Other": 1}, "Type" : {"FC" : 0, "IL" : 1, "DT" : 2}}
  1. Splitting data into train and test
  2. Evaluate the performace using R2 and RMSE errors
  • print ("R squared error : \n", model.score(x_test, y_test))

  • predictions = model.predict(x_test)

  • from sklearn.metrics import mean_squared_error

  • print ('RMSE error : \n', mean_squared_error(y_test, predictions))

  1. Found top correlated features using :
  • numeric_features = data.select_dtypes(include=[np.number])

  • corr = numeric_features.corr()

  • print (corr['revenue'].sort_values(ascending=False)[:6], '\n')

  • print (corr['revenue'].sort_values(ascending=False)[-6:])

Output :


Conclusion : By using topmost correlated features R^2 error was reduced from -0.66 to 0.05 and RMSE error also reduced from 0.457 t0 0.16


Learning :

  1. learned about regressions
  2. Converting string to int
  3. Finding the correlation

Challenges :

  1. Everything looks good.