Scikit Learn - lmucs/grapevine GitHub Wiki

#Scikit-Learn

Scikit Learn is an open-source Python framework for data manipulation and data analysis. You can check out there site here http://scikit-learn.org/. Scikit supplies support for a huge number of complicated data manipulation and machine learning techniques including Classification, Regression, Clustering, Dimensionality reduction, Model Selection, Preprocessing and more.

Scikit's documentation can be a little confusing at first but once you understand the basic process it is extremely helpful.

We recommend starting with these resources:

Gentle Intro to Machine Learning
- for beginners to machine learning this is a fantastic article
Scikit's Tutorials
- especially their basic machine learning tutorial
Harrison Kinsley's scikit video tutorials
- for basic instructions on integrating scikit with NLTK

Scikit in Grapevine

Scikit is the backbone of Grapevine's tag-classifier which separates our posts into different categories:
["food", "sports", "entertainment", "cultural", "deadline", "spirituality", "community service", "academic", "career development"]

We started by labeling a training set of 2000 posts and tweets with their respective categories.
After preprocessing our training data set of posts and tweets we use scikit's feature extraction and classifiers to save a pickled classifier for our tag-classifier.

We first convert the textual data into vectors using scikits tfidf vectorizer. This algorithm is nice because it builds vectors based on the term frequency and inverse document frequency. This means that words that show up a lot in all of the training data are factored out because they are common and not distinct. Now that we have important word feature vectors we can train a classifier on them.

Because our posts can be in multiple categories (for example, a Facebook post can be about a sporting event that also has food) we needed to use scikit's multilabel classifier, which allows for results to include multiple labels.

We used scikit's onevsall classifier which allows us to pass in one of scikit's single label classifiers and get back a multilabel classifier that uses the same algorithm.

After testing both a MultinomialNaiveBayes classifier and a LinearSupportVectorMachine algorithm we went with the LinearSupportVectorMachine classifier which gave us better results.

Running our final classifier on our test data set we were getting roughly 65% accuracy, which while not ideal, was pretty good regarding our extremely small training set.

Check out the training and classifier code on our repository.