Introduction - snakes-in-the-box/super-awesome-txt-classifier GitHub Wiki

For this project, we are using the Reuters Corpus, which is a set of news stories split into a hierarchy of categories. There are multiple class labels per document, but for the sake of simplicity we’ll ignore all but the labels ending in CAT:

  1. CCAT: Corporate/Industrial
  2. ECAT: Economics
  3. GCAT: Government/Social
  4. MCAT: Markets

There are some documents with more than one CAT label. Treat those documents as if you observed the same document once for each CAT label (that is, add to the counters for all the observed CAT labels)