9.3.1.K Nearest Neighbors - sj50179/IBM-Data-Science-Professional-Certificate GitHub Wiki

Introduction to Classification

What is classification?

  • A supervised learning approach
  • Categorizing some unknown items into a discrete set of categories or "classel"
  • The target attribute is a categorical variable

How does classification work?

  • Classification determines the class label for an unlabeled test case.

Untitled

Given a set of training data points along with the target labels, classification determines the class label for an unlabeled test case. Let's explain this with an example. A good sample of classification is the loan default prediction. Suppose a bank is concerned about the potential for loans not to be repaid? If previous loan default data can be used to predict which customers are likely to have problems repaying loans, these bad risk customers can either have their loan application declined or offered alternative products. The goal of a loan default predictor is to use existing loan default data which has information about the customers such as age, income, education et cetera, to build a classifier, pass a new customer or potential future default to the model, and then label it, i.e the data points as defaulter or not defaulter. Or for example zero or one. This is how a classifier predicts an unlabeled test case. Please notice that this specific example was about a binary classifier with two values. We can also build classifier models for both binary classification and multi-class classification.

Example of multi-class classification

Untitled

For example, imagine that you've collected data about a set of patients, all of whom suffered from the same illness. During their course of treatment, each patient responded to one of three medications. You can use this labeled dataset with a classification algorithm to build a classification model. Then you can use it to find out which drug might be appropriate for a future patient with the same illness. As you can see, it is a sample of multi-class classification.

Question

What is a multi-class classifier?

  • A classifier that can predict a field with two discrete values, such as ”Defaulter" or "Not Defaulter”.
  • A classifier that can predict a field with multiple discrete values, such as ”DrugA", ”DrugX" or "DrugY”.
  • A classifier that can predict multiple fields with many discrete values.

Correct

Classification use cases

Untitled

  • Which category a customer belongs to?
  • Whether a customer switches to another provider/brand?
  • Whether a customer responds to a pacticular advertising campaign?

Classification applications

Untitled

Data classification has several applications in a wide variety of industries. Essentially, many problems can be expressed as associations between feature and target variables, especially when labelled data is available. This provides a broad range of applicability for classification. For example, classification can be used for email filtering, speech recognition, handwriting recognition, biometric identification, document classification and much more.

Classification algorithms in machine learning

  • Decision Trees (ID3, C4.5, C5.0)
  • Naive Bayes
  • Linear Discriminant Abalysis
  • k-Nearest Neighbor
  • Logistic Regression
  • Neural Networks
  • Support Vector Machines (SVM)

K-Nearest Neighbors

Imagine that a telecommunications provider has segmented his customer base by service usage patterns, categorizing the customers into four groups.

Untitled

If demographic data can be used to predict group membership, the company can customize offers for individual perspective customers. This is a classification problem. That is, given the dataset with predefined labels, we need to build a model to be used to predict the class of a new or unknown case. The example focuses on using demographic data, such as region, age, and marital status to predict usage patterns.

The target field called custcat has four possible values that correspond to the four customer groups as follows: Basic Service, E Service, Plus Service, and Total Service. Our objective is to build a classifier. For example, using the row zero to seven to predict the class of row eight. We will use a specific type of classification called K-Nearest Neighbor.

Determining the class using 1st KNN

Untitled

Determining the class using the 5 KNNs

Untitled

What is K-Nearest Neighbor (or KNN)?

Untitled

  • A method for classifying cases based on their similarity to other cases
  • Cases that are near each other are said to be "neighbors"
  • Based on similar cases with same class labels are near each other

The K-Nearest Neighbors algorithm

  1. Pick a value for K.
  2. Calculate the distance of unknown case from all cases.
  3. Select the K-observations in the training data that are "nearest" to the unknown data point.
  4. Predict the response of the unknown data point using the most popular response value from the K-Nearest neighbors.

What is the best value of K for KNN?

Untitled

Question

Which of the following statements are TRUE about kNN?

  • The kNN algorithm is a classification algorithm.
  • The kNN algorithm classify cases based on their similarity to other cases.
  • The kNN algorithm works only with Euclidian distance.
  • The bigger K is in KNN, the accuracy of the algorithm is better.

Correct

Evaluation Metrics in Classification

Classification accuracy

The question is, “How accurate is this model?” Basically, we compare the actual values in the test set with the values predicted by the model, to calculate the accuracy of the model. Evaluation metrics provide a key role in the development of a model, as they provide insight to areas that might require improvement.

Untitled

Jaccard index

Untitled

Untitled

F1-score

Untitled

  • Precision = TP / (TP + FP)
  • Recall = TP / (TP + FN)
  • F1-score = 2 x (prc x rec) / (prc + rec)

Untitled

Log loss

  • Performance of a classifier where the predicted output is a probability value between 0 and 1.

Untitled

Untitled

Question

Which one is the ideal classifier?

  • The classifier with F1-score close to one.
  • The classifier with LogLoss close to one.
  • The classifier with Jaccard-index close to zero.

Correct