Assignment Week 4 - agastya2002/IECSE-ML-Winter-2020 GitHub Wiki

Hearty Congratulations!

You guys should pat yourselves on the back for having the patience and the perseverance to complete the initial 4 weeks of the ML winter project! If you have followed along with us properly, and have put in a good amount of hard-work, you should now have a thorough understanding of the supervised learning algorithms that have been taught by us. As part of this weeks assignment we want to improve your intuition of when and where to use these algorithms.

DEADLINE 16 Feb 2021

But before you get started with this week's assignment, we want you to learn about a few important aspects that will help you build better machine learning models!

Pandas Library

Pandas is a python library that is widely used in the data science community to handle various data in a tabular form. It has several useful features that make handling and preprocessing our data easier. Here is a link below that encompasses everything you need to know about the pandas library and how to get started with it! https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/

Data Preprocessing

As most of us end up learning the hard way, most of the time in any ML project is spent in collecting data and converting it in a suitable form that can be used by our ML model. In many cases you might skip the data collection part if you're using a dataset available online, and if you're working/researching on a new ML project that no one has previously worked on before, you will have to manually sit and create your own dataset. In either case you will have to perform data preprocessing, whether to handle NaN (Not a Number) values or to convert labels into one-hot vectors. Following are links that help you with the same. https://towardsdatascience.com/data-preprocessing-in-python-b52b652e37d5 https://analyticsindiamag.com/data-pre-processing-in-python/

NOTE: You can skip this if you have done the bonus task in week 2.

Assignment 1 - Classification Problem

This dataset is originally from the National Institute of Diabetes and Digestive and Kidney Diseases. The objective of the dataset is to diagnostically predict whether or not a patient has diabetes, based on certain diagnostic measurements included in the dataset. Several constraints were placed on the selection of these instances from a larger database. In particular, all patients here are females at least 21 years old of Pima Indian heritage.

https://www.kaggle.com/uciml/pima-indians-diabetes-database/data

Your objective for this task is to use all the classification algorithms - Logistic Regression, KNN, SVM, Decision Trees (you may explore other algorithms like Naive Bayes, Random Forests, etc. You can approach any of your mentors for help.) to create a classification model for the above dataset, and to infer from each type of model, which model works best.

You have to upload your source code, which also clearly states the accuracy you have obtained from each type of model. This will help you gain intuition as to which algorithm will be more suitable for classification problems!

Assignment 2 - Regression Problem

This dataset contains information about used cars listed on www.cardekho.com This data can be used for a lot of purposes such as price prediction to exemplify the use of linear regression in Machine Learning. The columns in the given dataset is as follows:

Car_Name
Year
Selling_Price
Present_Price
Kms_Driven
Fuel_Type
Seller_Type
Transmission
Owner

https://www.kaggle.com/nehalbirla/vehicle-dataset-from-cardekho

Your goal is to use linear regression to determine the selling price of the cars, using the algorithms that you have learned - Linear Regression, KNN and SVM and again list down the accuracies obtained using each model in your source code. This will help you gain intuition as to which algorithm will be more suitable for regression problems!

Here is the Starter Code

Note

You may use scikit-learn to implement above algorithms