Parser Wikipedia Overview - nielsenbe/Spark-Wiki-Parser GitHub Wiki

Overview

At its core a Wikipedia page can be broken into 6 fundamental elements: Headers, text, templates, links, tags, and tables.

Header: These render as traditional HTML H1, H2, H3, etc elements. They are used to divide the text into sections.

Text: Natural language part of the page. This also includes formatting elements like bold, italics, lists, etc. This parser focuses on content and not formatting. The text is retained, but the formatting is discarded.

Templates: These are a way to share common code among pages. For example a template called example1 may be defined as: [WikiLink][WikiLink2]This is example text Whenever a page calls {{example1}} the render will substitute the template for the defined code. By default this parser does not expand the templates. We provide code that can achieve this to a limited degree in post processing.

Tags: Mediawiki has the ability to define custom HTML tags. For the most part, the Sweble parser is not able to handle them and they end up in this bucket. The most common ones are pre, ref, and math.

Tables: These are converted to standard HTML tables. Most tables are designed for human and not computer consumption. Rather than trying to parse all possible combinations we leave the table in its HTML format and let the caller deal with it.

Parser work flow

  1. Decompress dump file
  2. Separate XML elements into pages
  3. For each page extract metadata and parse using the Sweble Engine
  4. Break page into components (headers, links, text, templates, etc)
  5. Use Spark SQL to clean and format
  6. Save to Spark tables (default is Parquet, user can override)

Other resources

Paper: Wikipedia as an NLP source