ICP4 - GeoSnipes/Big-Data GitHub Wiki

Sub-Team Members

5-2 15 Naga Venkata Satya Pranoop Mutha

5-2 23 Geovanni West


Confusion Matrix

A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known.

  • Accuracy:

    It determines the overall predicted accuracy of the model. It is calculated as Accuracy  = (True Positives + True Negatives)/(True Positives + True Negatives + False Positives + False Negatives)

  • Misclassification Rate: Also known as "Error Rate". Overall, how often is it wrong? (FP+FN)/total. Equivalent to 1 - Accuracy

  • True Positive Rate (TPR):  Indicates how many positive values, out of all the positive values, have been correctly predicted. The formula to calculate the true positive rate is (TP/(TP + FN)). Also, TPR =  1 - False Negative Rate.

  • False Positive Rate (FPR): Indicates how many negative values, out of all the negative values, have been incorrectly predicted. The formula to calculate the false positive rate is (FP/(FP + TN)). Also, FPR = 1 - True Negative Rate.

  • Specificity: When it's actually no, how often does it predict no?o TN/actual. Equivalent to 1 - False Positive Rate.

  • Precision: It indicates how many values, out of all the predicted positive values, are actually positive. It is formulated as:(TP / TP + FP). F Score: F score is the harmonic mean of precision and recall. It lies between 0 and 1. Higher the value, better the model. It is formulated as 2((precision*recall) / (precision+recall)).

  • Prevalence: How often does the yes condition actually occur in our sample? actual yes/total

Code Input

man woman
man man
woman woman
man man
woman man
woman woman
woman woman
man man
man woman
woman woman

PySpark Code

"""We have a test dataset of 10 records with expected outcomes and a set of predictions.
 Calculate the Confusion Matrix for the below data(using code)."""

from pyspark import SparkContext, SparkConf from pyspark.mllib.evaluation import MulticlassMetrics

conf = SparkConf().setAppName("ICP4").setMaster("local[*]") sc = SparkContext(conf=conf)

#expected predicted data = sc.textFile("data.txt")

#as to what i understand, MultiClassMetrics only uses pairs of floats #so on each line of input I split the classes(man or woman) by space #then I return 0 if is a man or 1 if is a woman exp_pre = data.map(lambda line: [(float(0) if x == 'man' else float(1)) for x in line.split(' ')]).cache()

#print out how it looks in the array/list, #data is small so this is possible #print(exp_pre.collect())

#convert data into a confusion matrix metrics = MulticlassMetrics(exp_pre)

#view the confusion matrix print(metrics.confusionMatrix().toArray()) print("Accuracy: {}".format(metrics.accuracy)) print("Misclassification rate: {:.2f}".format(1-metrics.accuracy)) #format to 2 decimal places, python only print("True positive rate: {:.2f}".format(metrics.truePositiveRate(0.0))) print("False positive rate: {:.2f}".format(metrics.falsePositiveRate(0.0))) print("Specificity: {:.2f}".format(1 - metrics.falsePositiveRate(1.0))) print("Precision: {:.2f}".format(metrics.precision(0.0)))

MultiClassMetrics takes only uses pairs of floats so on each line of input I split the classes(man or woman) by space then I return 0 if is a man or 1 if is a woman:
exp_pre = data.map(lambda line: [(float(0) if x == 'man' else float(1)) for x in line.split(' ')]).cache()

I then convert the data into a confusion matrix:
metrics = MulticlassMetrics(exp_pre)

I then use the inbuilt methods to produce the output formatted to two decimal places:
print("Accuracy: {}".format(metrics.accuracy))
print("Misclassification rate: {:.2f}".format(1-metrics.accuracy))
print("True positive rate: {:.2f}".format(metrics.truePositiveRate(0.0)))
print("False positive rate: {:.2f}".format(metrics.falsePositiveRate(0.0)))
print("Specificity: {:.2f}".format(1 - metrics.falsePositiveRate(1.0)))
print("Precision: {:.2f}".format(metrics.precision(0.0)))

Code Output

output

Summary

Based on the output, we can conclude that:

* It has a good accuracy, but not the best accurate model with an accuracy of above average of 70%.
* From the value of precision, we can also say that it is not so economic model.
⚠️ **GitHub.com Fallback** ⚠️