Phase I Summary - roeiba/WikiRep GitHub Wiki

Overview

In the first phase we implemented prototype, which works on small dumps and represent architectural skeleton for analyzing system.

We found number of related projects, which can be use as base for our implementation. Some of them we used in this phase (mwparserfromhell, scipy), some we tryed, but rejected (WikiExtractor). Some of them we left for possible future usage [gensim] (http://radimrehurek.com/gensim/), [scikit-learn] (http://scikit-learn.org/)

The implemented functionality includes:

Wiki pages downloader
- Tool for making small Wikipedia dumps from selected articles
Wiki pages parser
- Tool for converting Wikipedia page to our format, which has more related data and contains normal text (not wiki-markup), this tool also can treat links.
Inverted Index builder -
- This is the core tool, which builds table, which maps every word (from proceed pages) to relatedness vector (how this word is related to every concept).
Semantic Comparer
- Tool, which use Inverted Index table for finding correlation between texts.

Mode of operation

Downloading and building Wikipedia dump from provided article names.
Parsing the Wikipedia dump to our format.
Building Inverted Index
Stemming articles
Getting links from articles
Simple Pruning
Creating DF index, which counts number of articles term appears for every term.
Building TD-IDF index for all concepts
Compare between texts
Stemming input text with same stemmer, which has been used in 3.1
Calculate centroind for text according to vectors values of every actual term (not stemmed)
Comparing centroids using cosine metrics

Data structure

Wikipedia dump comes in copressed xml format .xml.bz2
Parsed dump is also xml file with 'doc' tags, which represents parsed document
- Content of 'doc' tag is regular text without Wikimedia markup
- Attributes
  - id - id number of original artilcle in wikipedia
  - rev_id - number of revision
  - title - title of article
Builded Inverted Index - pickled dump of DbContent object
- weight_matrix - dump of scipy csr matrix
- concepts_index - mapping from concept (title) to its index in weight_matrix
- words_index - mapping from term (string) to its index in weight_matrix
- stemmer_name - name of stemmer, which been used for stemming article text