Natural Language Processing - BKJackson/BKJackson_Wiki GitHub Wiki

NLP Articles

Data Mining Novels Reveals the Six Basic Emotional Arcs of Storytelling Scientists at the Computational Story Laboratory have identified the six emotional arcs that form the building blocks of all stories. (July 6, 2016)
Everything You Always Wanted to Know About NLP but Were Afraid to Ask S. Butler
Language Translation with Deep Learning and the Magic of Sequences Adam Geitgey
Incremental knowledge base construction using DeepDive From Shin et al., VLDB 2015 paper.

NLP Math Topics

Log Sum of Exponentials LingPipe Blog (NLP)

NLP Toolkits

NLTK 3.0
Text Analysis with NLTK Cheatsheet
MALLET A Java-based package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.
Stanford CoreNLP A set of natural language analysis tools.
Stanford Sentiment Analysis Richard Socher and others. Background info, code, and web app.
DARPA Open Catalog: Deep Exploration and Filtering of Text (DEFT) A huge list of NLP software, papers, etc., mostly from academic institutions.

Deep Learning & NLP

Misha Denil's Deep Learning for NLP Project Page
Memory Networks for Language Understanding, ICML Tutorial 2016 Jason Weston, Facebook
Memory Networks for question answering, etc. Jason Weston, Facebook
The bAbI project Resources related to the bAbI project of Facebook AI Research which is organized towards the goal of automatic text understanding and reasoning.

Latent Dirichlet Allocation

Introduction to Latent Dirichlet Allocation
A Hybrid lda2vec Algorithm Chris Moody
pyLDAvis Designed to help users interpret the topics in a topic model that has been fit to a corpus of text data. The package extracts information from a fitted LDA topic model to inform an interactive web-based visualization.

Regular Expressions

Google Summary of Regular Expressions

Optical Character Recognition and Layout Parsing

Tesseract OCR Optical character recognition and layout parsing
Cuneiform Optical character recognition and layout parsing

Word Vectors

Voynich Manuscript: word vectors and t-SNE visualization of some patterns
GloVe: Global Vectors for Word Representation Pennington, Socher, Manning.
Skipgram isn’t Matrix Factorisation Ben Wilson
Introducing our Hybrid lda2vec Algorithm By Chris Moody at Stitch Fix.
Softmax parameterisation and optimisation Ben Wilson
Making Sense of Everything with words2map By Yhat
wevi: Word Embedding Visual Inspector
Sebastian Ruder's take on word embeddings Part 1, Part 2 on Softmax
Visualizing Clusteres of Clickbait Headlines Using Spark, Word2vec, and Plotly Max Woolf
Whiskey Embeddings
Text Classification With Word2Vec Includes some examples of using the sklearn pipeline and multiple classifiers.

Facebook fastText

fastText Github Github repository for fastText, a library for efficient learning of word representations and sentence classification. A variation of word2vec.

DeepDive

DeepDive Home DeepDive is a system to extract value from dark data. Like dark matter, dark data is the great mass of data buried in text, tables, figures, and images, which lacks structure and so is essentially unprocessable by existing software.
DeepDive - CyberPunk Article
UW DeepDive Infrastructure Github
Hazy Research Github Chris Ré's github repo
Incremental knowledge base construction using DeepDive Review of Shin's 2015 paper by Adrian Colyer.

NLP Apps or Projects

corpkit: a GUI tool for investigating text
lazysummary Summarize any PDF. Reduce by as much as 90%. Uses Chrome.
Gender Decoder for Job Ads A quick app to check whether a job advert has linguistic gender-coding.
Visual Genome a dataset, a knowledge base, an ongoing effort to connect structured image concepts to language. Use graphs to express relationships.
AIVA AI general purpose virtual assistant for developers. Use it to create bots. Previously known as Jarvis. Mainly interfaces with Python and Node.js.
AIVA Contextual Graph Knowledge Base Information storage for the Turing machine. Example: CGKB(graphdb, fn = call, i_p = "John").
AIVA Human-Turing Machine Interface For question answering.

NLP Courses

Stanford - Foundations of Statistical Natural Language Processing Manning and Schutze.
Natural Language Processing Dan Jurafsky and Christopher Manning's Stanford Coursera course slides.
U. Washington - CSE 517: Natural Language Processing Winter 2016 Syllabus.

Converting PDFs to text

Turning PDF documents into analyzable data Stanford Social Science Data and Software (SSDS) Group
PDFMiner For PDFMiner API and pdf2txt.py
Converting PDF to Text using Tesseract and ghostscript Hubbard, 2015.
pypdfocr 0.9.1 Take a scanned PDF file and run OCR on it (using the Tesseract OCR software from Google), generating a searchable PDF. Last updated 10/2016.
Tutorial: Text Extraction and OCR with Tesseract and ImageMagick Erick Peirson, Dec. 2015
PDFMiner Github. PDFMiner is a tool for extracting information from PDF documents written entirely in Python.

Optical Character Recognition

Google's Tesseract OCR On github.