Parser Wikipedia Data Dictionary - nielsenbe/Spark-Wiki-Parser GitHub Wiki

General structure

Redirect
Page
- Header
  - External Link
  - Wiki Link
  - Table
  - Tag
  - Template
  - Template Parameters
  - Text

Initial parsing / cleaning notes

Formatting tags (bold, italic, etc) are removed
Lists are turned into tables
Image links are transformed into Wiki links
The Sweble parser has issues with some tags. By default we do not re-parse as it would considerably increase the parsing time. The option to try to extract the elements is off by default.
Templates are macros to other Wikipedia pages. By default they are left un-expanded.
Nested templates are separated.
Wikipedia links are added to the text as they often are part of the paragraph:
- Albedo is an important concept in [[climatology]], [[astronomy]], and environmental management
Links and templates are extracted from tables but the text is not.
Tables are converted to an HTML format. Table converted to HTML form. Wiki tables are tricky to capture in a common structured form. Columns and rows can be merged. Table header tags can be abused. We default to leaving it in HTML and let the caller deal with it.

Intermediate Dataset format

The parser produces a Dataset of Java Objects. In the default process this is immediately flattened into data frames, cleaned, and sent to files. The format of this Dataset is detailed in the domain class:

SQL for cleaning / transformation

Most of the cleaning happens post parsing in Spark SQL. The definitions can be found here:

Data Dictionary

This is the format of the database when the default settings are used.

wkp_page

Top level information for a page item in the Wiki dump. Could be an article or a template. Redirects are stored in a separate file.

id -> unique identifier generated by Wikipedia
title -> official Wikipedia name
entity -> left part of the entity_(sense) format for example: Cars_(film)
sense -> right party of the entity_(sense) format for example: Cars_(film)
name_space_num -> Wikipedia namespace of the article Namespace
name_space_text -> textual representation of the name space
page_type -> similar to namespace, but Articles are split into ARTICLE, REDIRECT, DISAMBIGUATION, and LIST. Based on heuristics.
revision_id -> unique identifier for page revision. If history is not being used this can be ignored.
revision_date -> date of revision
text_length -> estimate of text length of entire article
wiki_link_count -> number of links to Wikipedia articles
template_count -> number of templates in article
cleanup_template_count -> number of templates with type: "CLEANUP", "CLEANUP AFD", "CLEANUP SECTION", "CLEANUP-REORGANIZE","CLEANUP-REWRITE", "REQUIRES ATTENTION", "CLEANUP TRANSLATION","CLEANUP-PR","COPY EDIT", "COPYEDIT", "COPY EDIT-SECTION"

The major post process cleaning items are splitting the title into entity and sense and providing the English names for the namespaces.

wkp_redirect

Redirects allow editors to rename articles without breaking existing links. They also provide a form of disambiguation for titles.

target_page_id -> The page that the page will redirect to
redirect_page_id -> Id for the redirect
redirect_title -> Name of the redirect

wkp_header

Wikipedia articles are divided into sections separated by headers. Much semantic meaning can be derived from the header.

parent_page_id -> Wikimedia Id for the page
parent_revision_id -> Revision Id element is associated with
header_id -> Unique (to the page) identifier for a header.
title -> Text of the header
header_level -> Header depth. 1 is Lead H2 = 2, H3 = 3, etc.
is_ancillary -> Ancillary sections do not have semantic content. Things like references, external links, notes, etc. If you are doing natural language work then it is good to exclude these sections.
- REFERENCES
- EXTERNAL LINKS
- SEE ALSO
- NOTES
- LICENSING
- BIBLIOGRAPHY
- FURTHER READING
- SOURCES
- FOOTNOTES
- PUBLICATIONS
- USERS
- LINKS
text_length -> The rough character count for the section.(Excludes sub sections)
wiki_link_count -> Number of links in this section. (Excludes sub sections)
template_count -> Number of templates in this section. (Excludes sub sections)
cleanup_template_count -> Number of templates with the following types:
- CLEANUP,
- CLEANUP AFD
- CLEANUP SECTION
- CLEANUP-REORGANIZE
- CLEANUP-REWRITE
- REQUIRES ATTENTION
- CLEANUP TRANSLATION
- CLEANUP-PR
- COPY EDIT
- COPYEDIT
- COPY EDIT-SECTION

wkp_external_link

For any link that is not internal to Wikimedia.

parent_page_id -> Wikimedia Id for the page
parent_revision_id -> Revision Id element is associated with
parent_header_id -> The header the element is a child of.
element_id -> Unique (to the page) integer for an element.
destination -> The target URL
link_text -> The hyper link text for the link
domain -> The inner domain of a link
page_bookmark -> The pages book mark, if any.

wkp_wiki_link

Any link that is internal to Wikimedia.

parent_page_id -> Wikimedia Id for the page
parent_revision_id -> Revision Id element is associated with
parent_header_id -> The header the element is a child of.
element_id -> Unique (to the page) integer for an element.
destination -> The canonical destination name.
destination_page_id -> The canonical destination page id.
wiki_name_space -> The Wikimedia name space for the target page.
link_text -> The hyper link text for the link
page_bookmark -> The pages book mark, if any.
page_exists -> A binary yes or no if the target page exists. We do two very important cleaning steps with internal Wikimedia links. The first is that we normalize for redirects. For example [Apple tree], [Apple (Fruit)], [Apples] all redirect to [Apple]. If you do not account for redirects in links then these links would all count for different articles. The second major cleaning is that often editors will use enwikipedia.org/link instead of the proper Article Name format. The transform SQL does its best to convert those mistakes back.

wkp_table

This table holds all tables, lists, and unordered lists.

parent_page_id -> Wikimedia Id for the page
parent_revision_id -> Revision Id element is associated with
parent_header_id -> The header the element is a child of.
element_id -> Unique (to the page) integer for an element.
tableHtmlType -> The primary html element of the table TABLE, OL, UL, or DL
caption -> Table title (if any).
html -> Table in HTML format using <DL> <OL> or <TABLE> tags depending on type Table converted to HTML form. Wiki tables are tricky to capture in a common structured form. Columns and rows can be merged. Table header tags can be abused. We default to leaving it in HTML and let the caller deal with it. The transform SQL also excludes many blank or empty tables. Empty tables are excluded from final result.

wkp_tag

Contains info about an HTML tag. Mostly these are tags that Sweble cannot parse. Special XML tags that are not handled else where in the code. For the most part, ref and math are the main ones.

parent_page_id -> Wikimedia Id for the page
parent_revision_id -> Revision Id element is associated with
parent_header_id -> The header the element is a child of.
element_id -> Unique (to the page) integer for an element.
tag -> tag name (without brackets)
tag_value -> contents inside of the tags

wkp_template

Templates are macros for Wikipedia. They allow common code to be shared among articles. For example {{Global warming}} will create a table with links that are common to all GW related pages. Any change to the template will reflect on all pages that use that template.

parent_page_id -> Wikimedia Id for the page
parent_revision_id -> Revision Id element is associated with
parent_header_id -> The header the element is a child of.
element_id -> Unique (to the page) integer for an element.
template_type -> name of the template
is_info_box -> Info boxes are a special kind of Wikipedia template. They are tables of information found on the upper right hand side of an article. They contain valuable key value pair information.

wkp_template_param

A template may have 0 to many parameters. They can be named parameters or positional.

parent_page_id -> Wikimedia Id for the page
parent_revision_id -> Revision Id element is associated with
parent_header_id -> The header the element is a child of.
element_id -> Unique (to the page) integer for an element.
param_name -> Name. If positional it will be *POS_[order]
param_value -> value of the parameter

wkp_text

The natural language text of a Wikipedia article. The wikicode parsing process isn't an exact process and some artifacts and junk are to be expected.

parent_page_id -> Wikimedia Id for the page
parent_revision_id -> Revision Id element is associated with
parent_header_id -> The header the element is a child of.
text -> text fragment
text_length -> character count of the text fragment Sections that are empty or have less than 20 characters are removed.