ICP 4 - ntihindukkipati/CS5590_Python_DL GitHub Wiki

ICP4

1. Find the correlation between ‘survived’(target column) and ‘sex’ column for the Titanic use case in class. Do you think we should keep this feature?

Screenshot (367)

From the above we can see that there is more probability for females to survive rather than males. So, it is required to use the sex column to keep.

2. Implement Naïve Bayes method using scikit-learn libraryUse dataset available in https://umkc.box.com/s/anji6c8g6034ptm0hgii6fhcu919kx8xUse train_test_splitto create training and testing partEvaluate the model on testing partusing score and classification_report(y_true, y_pred)

Screenshot (368)

3. Implement linear SVMmethodusing scikit libraryUse the samedataset aboveUse train_test_splitto create training and testing partEvaluate the model on testing partusing score and classification_report(y_true, y_pred)Which algorithm you got better accuracy? Can you justify why?

Screenshot (369)

Naive Bayes algorithm has more accuracy than a support vector machine algorithm. From a theoretical point of view, it is a little bit hard to compare the two methods. One is probabilistic in nature, while the second one is geometric. However, it's quite easy to come up with a function where one has dependencies between variables that are not captured by Naive Bayes (y(a,b) = ab), so we know it isn't a universal approximator. SVMs with the proper choice of Kernel are (as are 2/3-layer neural networks) though, so from that point of view, the theory matches the practice. But in the end, it comes down to performance on your problem - you basically want to choose the simplest method which will give good enough results for your problem and have a good enough performance. Spam detection has been famously solvable by just Naive Bayes, for example. Face recognition in images by a similar method enhanced with boosting etc.

Dukkipati, Sri Sai Nithin Chowdary Class Id: 4 Team: 6