Confusion Matrix

A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known.

Accuracy:
It determines the overall predicted accuracy of the model. It is calculated as Accuracy = (True Positives + True Negatives)/(True Positives + True Negatives + False Positives + False Negatives)

Misclassification Rate: Also known as "Error Rate". Overall, how often is it wrong? (FP+FN)/total. Equivalent to 1 - Accuracy

True Positive Rate (TPR): Indicates how many positive values, out of all the positive values, have been correctly predicted. The formula to calculate the true positive rate is (TP/(TP + FN)). Also, TPR = 1 - False Negative Rate.

False Positive Rate (FPR): Indicates how many negative values, out of all the negative values, have been incorrectly predicted. The formula to calculate the false positive rate is (FP/(FP + TN)). Also, FPR = 1 - True Negative Rate.

Specificity: When it's actually no, how often does it predict no?o TN/actual. Equivalent to 1 - False Positive Rate.

Precision: It indicates how many values, out of all the predicted positive values, are actually positive. It is formulated as:(TP / TP + FP). F Score: F score is the harmonic mean of precision and recall. It lies between 0 and 1. Higher the value, better the model. It is formulated as 2((precision*recall) / (precision+recall)).

Prevalence: How often does the yes condition actually occur in our sample? actual yes/total

Code Input

man woman
man man
woman woman
man man
woman man
woman woman
woman woman
man man
man woman
woman woman

PySpark Code

"""We have a test dataset of 10 records with expected outcomes and a set of predictions.
 Calculate the Confusion Matrix for the below data(using code)."""
from pyspark import SparkContext, SparkConf
from pyspark.mllib.evaluation import MulticlassMetrics
conf = SparkConf().setAppName("ICP4").setMaster("local[*]")
sc = SparkContext(conf=conf)
#expected   predicted
data = sc.textFile("data.txt")
#as to what i understand, MultiClassMetrics only uses pairs of floats
#so on each line of input I split the classes(man or woman) by space
#then I return 0 if is a man or 1 if is a woman
exp_pre = data.map(lambda line: [(float(0) if x == 'man' else float(1)) for x in line.split(' ')]).cache()
#print out how it looks in the array/list,
#data is small so this is possible
#print(exp_pre.collect())
#convert data into a confusion matrix
metrics = MulticlassMetrics(exp_pre)
#view the confusion matrix
print(metrics.confusionMatrix().toArray())
print("Accuracy: {}".format(metrics.accuracy))
print("Misclassification rate: {:.2f}".format(1-metrics.accuracy)) #format to 2 decimal places, python only
print("True positive rate: {:.2f}".format(metrics.truePositiveRate(0.0)))
print("False positive rate: {:.2f}".format(metrics.falsePositiveRate(0.0)))
print("Specificity: {:.2f}".format(1 - metrics.falsePositiveRate(1.0)))
print("Precision: {:.2f}".format(metrics.precision(0.0)))

MultiClassMetrics takes only uses pairs of floats so on each line of input I split the classes(man or woman) by space then I return 0 if is a man or 1 if is a woman:
exp_pre = data.map(lambda line: [(float(0) if x == 'man' else float(1)) for x in line.split(' ')]).cache()

I then convert the data into a confusion matrix:
metrics = MulticlassMetrics(exp_pre)

I then use the inbuilt methods to produce the output formatted to two decimal places:
print("Accuracy: {}".format(metrics.accuracy))
print("Misclassification rate: {:.2f}".format(1-metrics.accuracy))
print("True positive rate: {:.2f}".format(metrics.truePositiveRate(0.0)))
print("False positive rate: {:.2f}".format(metrics.falsePositiveRate(0.0)))
print("Specificity: {:.2f}".format(1 - metrics.falsePositiveRate(1.0)))
print("Precision: {:.2f}".format(metrics.precision(0.0)))

Code Output

Summary

Based on the output, we can conclude that:

* It has a good accuracy, but not the best accurate model with an accuracy of above average of 70%.
* From the value of precision, we can also say that it is not so economic model.

ICP4 - GeoSnipes/Big-Data GitHub Wiki

Sub-Team Members

Confusion Matrix

Accuracy:

Code Input

PySpark Code

Code Output

⚠️ GitHub.com Fallback ⚠️

ICP4 - GeoSnipes/Big-Data GitHub Wiki

Sub-Team Members

Confusion Matrix

Accuracy:

Code Input

PySpark Code

Code Output

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️