Home - nielsenbe/Spark-Wiki-Parser GitHub Wiki

What does this app do?

This application uses Apache Spark to parse a Wikipedia dump, clean it, and convert it a flattened series of files.

How is this software different from other parsers?

It is true that there are many Wikipedia parsers out there. This parser offers the following:

  • Native Spark integration
  • Does significant cleaning and formatting of the parser output
  • Flattens the output into more manageable formats

Does this application make sense for your needs?

This application is designed to be used with Apache Spark and a cluster. This application will also work in local mode, but you'll probably want to use a partial dump file for testing. You can parse a full file using a single node, it'll just take a while. (8+ hours) If you have no intention to use a cluster or don't need the data cleaning this app provides then you may want to look into these projects:

If you just need a very specific component (just links or just templates) then DBPedia might be a better fit: DBPedia

Why Spark?

Apache Spark was chosen for a number of reasons. The primary reason is that it offers excellent support for working with numerous data formats. Reading the compressed BZ2 Wikipedia dump is just a few lines of code. The secondary reason is taking advantage of Spark's data frame and machine learning libraries. The data frame / data set library makes analyzing and joining data simple. The machine learning library has a wealth of useful tools. Lastly Spark makes the dump's size (15GB compressed ~80GB uncompressed) manageable.

Project Goals

  • Grant researchers easy access to the data stored in the Wikipedia data dump.
    • This project aims at abstracting away the complexity of data preparation so the researchers can focus on analysis.
    • A heavy focus is put on taking advantage of Spark's dataframe/dataset API.
  • Focus on content and semantics over formatting and syntax.
    • This project assumes that the user is less concerned about formatting (bold, italics, element positioning) and more about what the content.
    • We also assume that this data will be used in bulk and that minor parsing errors are not a huge concern.
  • Take advantage of Spark's built in data import and export functionality.
    • No need to reinvent the wheel.