Python NLTK - lmucs/grapevine GitHub Wiki
Python NLTK
Python Natural Language ToolKit (NLTK) is available at http://www.nltk.org. It is an excellent framework for text processing and machine learning on textual data. It provides a multitude of important text manipulation functions and basic classifiers.
Here are some great tutorials to get you started:
- Natural Language Processing with Python (NLTK's open-source book)
- [Harrison Kinsley's excellent nltk video tutorials] (https://pythonprogramming.net/dashboard/#tab_nltk)
NLTK in Grapevine
Python NLTK was a great resource for us. We started building our classifier using NLTK's text processing and Naive Bayes Classifier. After we realized that we needed a more complicated classifier and switched to using Scikit-Learn to build the classifier. However, we still preprocess our text using NLTK's text processing features.
NLTK allowed us to easily add text processing functionality, such as tokenizing into posts and tweets into words, as well as removing custom stopwords and links that would clog our training set.
Future Work
The NLTK preprocessing that we are currently performing in Grapveine classifier is very simple, only a basic stopword removal and tokenization. In the future we would like to test how word lemmatization and multi-gram features may improve our classifier. By using NLTK we are able to easily extend our pre-processing task to include this functionality.