Pruning - roeiba/WikiRep GitHub Wiki

From original WikiPrep implementation

3.2 Implementation Details We used Wikipedia snapshot as of November 11, 2005. After parsing the Wikipedia XML dump, we obtained 1.8 Gb of text in 910,989 articles. Although Wikipedia has almost a million articles, not all of them are equally useful for feature generation. Some articles correspond to overly specific concepts (e.g., Metnal, the ninth level of the Mayan underworld), or are otherwise unlikely to be useful for subsequent text categorization (e.g., specific dates or a list of events that happened in a particular year). Other articles are just too short, so we cannot reliably classify texts onto the corresponding concepts.

We developed a set of simple heuristics for pruning the set of concepts, by discarding articles that have fewer than 100 non stop words or fewer than 5 incoming and outgoing links. We also discard articles that describe specific dates, as well as Wikipedia disambiguation pages, category pages and the like.

After the pruning, 171,332 articles were left that defined concepts used for feature generation. We processed the text of these articles by first tokenizing it, removing stop words and rare words (occurring in fewer than 3 articles), and stemmed the remaining words; this yielded 296,157 distinct terms.