Data - fcrimins/fcrimins.github.io GitHub Wiki

3 Million Instacart Orders, Open Sourced

Google's One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling (3/24/17)

NLTK has its own datasets (3/20/17)

  • downloaded here: ~/nltk_data

First billion characters from Wikipedia

UCI Machine Learning Datasets

The top bestsellers of 1916

  • But what are the bestsellers from 1916 with sales normalized by year after publication?

Ratings datasets are figuratively just lying around the web these days, begging for someone to take notice and analyze them.

  • Movie reviews from the Netflix Prize dataset
  • Business reviews from the Yelp Academic Dataset, as summarized here
  • Amazon book reviews from the Multi-domain Sentiment Dataset
  • News ratings dataset from Reddit

41 Machine Learning Interview Questions (1/30/17)

Generating Factoid Questions With Recurrent Neural Networks: The 30M Factoid Question-Answer Corpus (4/5/16)

Yahoo released the largest ever datasets (1/16/16)