Wikipedia Parsing Tools - roeiba/WikiRep GitHub Wiki

List of tools, which can be used to parse wikipedia

Used tools

Wiki markup parser [MWParserFromHell] (https://github.com/earwig/mwparserfromhell)

Wikipedia resources

[Online retrieve xml page] (http://en.wikipedia.org/wiki/Help:Export)
[Wikipedia database download] (http://en.wikipedia.org/wiki/Wikipedia:Database_download)

Wikipedia Extractor : [Online demo] (https://zvulon.pythonanywhere.com/wiki_xml_view?title=Knowledge)

The project uses the Italian Wikipedia as source of documents for several purposes: as training data and as source of data to be annotated.

The Wikipedia maintainers provide, each month, an XML dump of all documents in the database: it consists of a single XML file containing the whole encyclopedia, that can be used for various kinds of analysis, such as statistics, service lists, etc.

The Wikipedia extractor tool generates plain text from a Wikipedia database dump, discarding any other information or annotation present in Wikipedia pages, such as images, tables, references and lists.

[Pywikipediabot] (http://www.mediawiki.org/wiki/Manual:Pywikipediabot)

The Python Wikipedia Robot Framework is a collection of tools made to fit the maintenance need on Wikipedia, but it can also be used on other MediaWiki sites. Originally designed for Wikipedia, it is now used throughout the Wikimedia Foundation's projects and on many other MediaWiki wikis.

Resorces

There is a trunk release that supports older mediawiki software. There is the rewrite branch of the Python Wikipedia Robot Framework. It features several improvements, such as full API usage and a pythonic package layout

[WikiPrep] (http://www.cs.technion.ac.il/~gabr/resources/code/wikiprep/)

Wikipedia is a terrific knowledge resource, and many recent studies in artificial intelligence, information retrieval and related fields used Wikipedia to endow computers with (some) human knowledge. Wikipedia dumps are publicly available in XML format, but they have a few shortcomings. First, they contain a lot of information that is often not used when Wikipedia texts are used as knowledge (e.g., ids of users who changed each article, timestamps of article modifications). On the other hand, the XML dumps do not contain a lot of useful information that could be inferred from the dump, such as link tables, category hierarchy, resolution of redirection links etc.

In the course of my Ph.D. work, I developed a fairly extensive preprocessor of the standard Wikipedia XML dump into my own extended XML format, which eliminates some information and adds other useful information.