# ICP4 - GeoSnipes/Big-Data GitHub Wiki

#### Sub-Team Members

5-2 15 Naga Venkata Satya Pranoop Mutha

5-2 23 Geovanni West

# Confusion Matrix

A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known.

• ### Accuracy:

It determines the overall predicted accuracy of the model. It is calculated as Accuracy  = (True Positives + True Negatives)/(True Positives + True Negatives + False Positives + False Negatives)

• Misclassification Rate: Also known as "Error Rate". Overall, how often is it wrong? (FP+FN)/total. Equivalent to 1 - Accuracy

• True Positive Rate (TPR):  Indicates how many positive values, out of all the positive values, have been correctly predicted. The formula to calculate the true positive rate is (TP/(TP + FN)). Also, TPR =  1 - False Negative Rate.

• False Positive Rate (FPR): Indicates how many negative values, out of all the negative values, have been incorrectly predicted. The formula to calculate the false positive rate is (FP/(FP + TN)). Also, FPR = 1 - True Negative Rate.

• Specificity: When it's actually no, how often does it predict no?o TN/actual. Equivalent to 1 - False Positive Rate.

• Precision: It indicates how many values, out of all the predicted positive values, are actually positive. It is formulated as:(TP / TP + FP). F Score: F score is the harmonic mean of precision and recall. It lies between 0 and 1. Higher the value, better the model. It is formulated as 2((precision*recall) / (precision+recall)).

• Prevalence: How often does the yes condition actually occur in our sample? actual yes/total

### Code Input

```man woman
man man
woman woman
man man
woman man
woman woman
woman woman
man man
man woman
woman woman```

### PySpark Code

``````"""We have a test dataset of 10 records with expected outcomes and a set of predictions.
Calculate the Confusion Matrix for the below data(using code)."""
from pyspark import SparkContext, SparkConf
from pyspark.mllib.evaluation import MulticlassMetrics
conf = SparkConf().setAppName("ICP4").setMaster("local[*]")
sc = SparkContext(conf=conf)
#expected   predicted
data = sc.textFile("data.txt")
#as to what i understand, MultiClassMetrics only uses pairs of floats
#so on each line of input I split the classes(man or woman) by space
#then I return 0 if is a man or 1 if is a woman
exp_pre = data.map(lambda line: [(float(0) if x == 'man' else float(1)) for x in line.split(' ')]).cache()
#print out how it looks in the array/list,
#data is small so this is possible
#print(exp_pre.collect())
#convert data into a confusion matrix
metrics = MulticlassMetrics(exp_pre)
#view the confusion matrix
print(metrics.confusionMatrix().toArray())
print("Accuracy: {}".format(metrics.accuracy))
print("Misclassification rate: {:.2f}".format(1-metrics.accuracy)) #format to 2 decimal places, python only
print("True positive rate: {:.2f}".format(metrics.truePositiveRate(0.0)))
print("False positive rate: {:.2f}".format(metrics.falsePositiveRate(0.0)))
print("Specificity: {:.2f}".format(1 - metrics.falsePositiveRate(1.0)))
print("Precision: {:.2f}".format(metrics.precision(0.0)))``````

MultiClassMetrics takes only uses pairs of floats so on each line of input I split the classes(man or woman) by space then I return 0 if is a man or 1 if is a woman:
exp_pre = data.map(lambda line: [(float(0) if x == 'man' else float(1)) for x in line.split(' ')]).cache()

I then convert the data into a confusion matrix:
metrics = MulticlassMetrics(exp_pre)

I then use the inbuilt methods to produce the output formatted to two decimal places:
print("Accuracy: {}".format(metrics.accuracy))
print("Misclassification rate: {:.2f}".format(1-metrics.accuracy))
print("True positive rate: {:.2f}".format(metrics.truePositiveRate(0.0)))
print("False positive rate: {:.2f}".format(metrics.falsePositiveRate(0.0)))
print("Specificity: {:.2f}".format(1 - metrics.falsePositiveRate(1.0)))
print("Precision: {:.2f}".format(metrics.precision(0.0)))

### Code Output

Summary

Based on the output, we can conclude that:

* It has a good accuracy, but not the best accurate model with an accuracy of above average of 70%.
* From the value of precision, we can also say that it is not so economic model.