ICP4 - GeoSnipes/Big-Data GitHub Wiki
5-2 15 Naga Venkata Satya Pranoop Mutha
5-2 23 Geovanni West
A confusion matrix is a table that is often used to describe the performance of a classification model (or "classifier") on a set of test data for which the true values are known.
- It determines the overall predicted accuracy of the model. It is calculated as Accuracy = (True Positives + True Negatives)/(True Positives + True Negatives + False Positives + False Negatives)
- Misclassification Rate: Also known as "Error Rate". Overall, how often is it wrong? (FP+FN)/total. Equivalent to 1 - Accuracy
- True Positive Rate (TPR): Indicates how many positive values, out of all the positive values, have been correctly predicted. The formula to calculate the true positive rate is (TP/(TP + FN)). Also, TPR = 1 - False Negative Rate.
- False Positive Rate (FPR): Indicates how many negative values, out of all the negative values, have been incorrectly predicted. The formula to calculate the false positive rate is (FP/(FP + TN)). Also, FPR = 1 - True Negative Rate.
- Specificity: When it's actually no, how often does it predict no?o TN/actual. Equivalent to 1 - False Positive Rate.
- Precision: It indicates how many values, out of all the predicted positive values, are actually positive. It is formulated as:(TP / TP + FP). F Score: F score is the harmonic mean of precision and recall. It lies between 0 and 1. Higher the value, better the model. It is formulated as 2((precision*recall) / (precision+recall)).
- Prevalence: How often does the yes condition actually occur in our sample? actual yes/total
man woman man man woman woman man man woman man woman woman woman woman man man man woman woman woman
"""We have a test dataset of 10 records with expected outcomes and a set of predictions.
Calculate the Confusion Matrix for the below data(using code)."""
from pyspark import SparkContext, SparkConf
from pyspark.mllib.evaluation import MulticlassMetrics
conf = SparkConf().setAppName("ICP4").setMaster("local[*]")
sc = SparkContext(conf=conf)
#expected predicted
data = sc.textFile("data.txt")
#as to what i understand, MultiClassMetrics only uses pairs of floats
#so on each line of input I split the classes(man or woman) by space
#then I return 0 if is a man or 1 if is a woman
exp_pre = data.map(lambda line: [(float(0) if x == 'man' else float(1)) for x in line.split(' ')]).cache()
#print out how it looks in the array/list,
#data is small so this is possible
#print(exp_pre.collect())
#convert data into a confusion matrix
metrics = MulticlassMetrics(exp_pre)
#view the confusion matrix
print(metrics.confusionMatrix().toArray())
print("Accuracy: {}".format(metrics.accuracy))
print("Misclassification rate: {:.2f}".format(1-metrics.accuracy)) #format to 2 decimal places, python only
print("True positive rate: {:.2f}".format(metrics.truePositiveRate(0.0)))
print("False positive rate: {:.2f}".format(metrics.falsePositiveRate(0.0)))
print("Specificity: {:.2f}".format(1 - metrics.falsePositiveRate(1.0)))
print("Precision: {:.2f}".format(metrics.precision(0.0)))
MultiClassMetrics takes only uses pairs of floats so on each line of input I split the classes(man or woman) by space
then I return 0 if is a man or 1 if is a woman:
exp_pre = data.map(lambda line: [(float(0) if x == 'man' else float(1)) for x in line.split(' ')]).cache()
I then convert the data into a confusion matrix:
metrics = MulticlassMetrics(exp_pre)
I then use the inbuilt methods to produce the output formatted to two decimal places:
print("Accuracy: {}".format(metrics.accuracy))
print("Misclassification rate: {:.2f}".format(1-metrics.accuracy))
print("True positive rate: {:.2f}".format(metrics.truePositiveRate(0.0)))
print("False positive rate: {:.2f}".format(metrics.falsePositiveRate(0.0)))
print("Specificity: {:.2f}".format(1 - metrics.falsePositiveRate(1.0)))
print("Precision: {:.2f}".format(metrics.precision(0.0)))
Summary
Based on the output, we can conclude that:
* It has a good accuracy, but not the best accurate model with an accuracy of above average of 70%.* From the value of precision, we can also say that it is not so economic model.