Home - snakes-in-the-box/super-awesome-txt-classifier GitHub Wiki

Welcome to the super-awesome-txt-classifier wiki!

This is not for usage instructions. If you want to know how to use this system please see the README that is apart of this repositories source.

If you want to learn more about what this system is for and how it operates the please read on!

This project is a distributed version of a Naive Bayes text classifier built with Scala and Spark. The Naive Bayes algorithm is frequently used in Machine Learning problems for three main reasons. The first reason is it's an extremely simple algorithm leading to ease of implementation and maintenance. The second is that Naive Bayes is computationally efficient, often classifying much faster than more intensive approaches. The third is that despite it's simplicity it's accuracy can match or beat more complicated classifiers for some kinds of problems, in particular it performs well in the Natural Language domain.

In order to perform Naive Bayes classification on a text document we need to calculate two probabilities, the Prior and the Posterior. The Prior is simply the likelihood of a document being a certain category considering the kinds of documents in the training set. Suppose our training set is 10 documents, 6 of category A and 4 of category B. The Prior Probability of a test document being classified as A is P(0.6) or 60%. For the Posterior probability we need to start looking at the contents of our documents. Suppose the document we are trying to classify contains the word "Dog". We need to know the "Dogishness" of the categories we are considering. In this example that is simply the number of times "Dog" appears in a given category divided by the number of times "Dog" appears in all categories. The Posterior probability is the product of of this number for every unique word in the document we are trying to classify. After we calculate the Posterior we multiply it by the Prior and the result is the score for the document for that particular category. We do this for every category and classify the document as the category with the highest score.