Phase I Summary - roeiba/WikiRep GitHub Wiki
Overview
In the first phase we implemented prototype, which works on small dumps and represent architectural skeleton for analyzing system.
We found number of related projects, which can be use as base for our implementation. Some of them we used in this phase (mwparserfromhell, scipy), some we tryed, but rejected (WikiExtractor). Some of them we left for possible future usage [gensim] (http://radimrehurek.com/gensim/), [scikit-learn] (http://scikit-learn.org/)
The implemented functionality includes:
- Wiki pages downloader
- Tool for making small Wikipedia dumps from selected articles
- Wiki pages parser
- Tool for converting Wikipedia page to our format, which has more related data and contains normal text (not wiki-markup), this tool also can treat links.
- Inverted Index builder -
- This is the core tool, which builds table, which maps every word (from proceed pages) to relatedness vector (how this word is related to every concept).
- Semantic Comparer
- Tool, which use Inverted Index table for finding correlation between texts.
Mode of operation
- Downloading and building Wikipedia dump from provided article names.
- Parsing the Wikipedia dump to our format.
- Building Inverted Index
- Stemming articles
- Getting links from articles
- Simple Pruning
- Creating DF index, which counts number of articles term appears for every term.
- Building TD-IDF index for all concepts
- Compare between texts
- Stemming input text with same stemmer, which has been used in 3.1
- Calculate centroind for text according to vectors values of every actual term (not stemmed)
- Comparing centroids using cosine metrics
Data structure
- Wikipedia dump comes in copressed xml format .xml.bz2
- Parsed dump is also xml file with 'doc' tags, which represents parsed document
- Content of 'doc' tag is regular text without Wikimedia markup
- Attributes
- id - id number of original artilcle in wikipedia
- rev_id - number of revision
- title - title of article
- Builded Inverted Index - pickled dump of DbContent object
- weight_matrix - dump of scipy csr matrix
- concepts_index - mapping from concept (title) to its index in weight_matrix
- words_index - mapping from term (string) to its index in weight_matrix
- stemmer_name - name of stemmer, which been used for stemming article text