Home - nielsenbe/Spark-Wiki-Parser GitHub Wiki

What does this app do?

This application uses Apache Spark to parse a Wikipedia dump, clean it, and convert it a flattened series of files.

How is this software different from other parsers?

It is true that there are many Wikipedia parsers out there. This parser offers the following:

Native Spark integration
Does significant cleaning and formatting of the parser output
Flattens the output into more manageable formats

Does this application make sense for your needs?

This application is designed to be used with Apache Spark and a cluster. This application will also work in local mode, but you'll probably want to use a partial dump file for testing. You can parse a full file using a single node, it'll just take a while. (8+ hours) If you have no intention to use a cluster or don't need the data cleaning this app provides then you may want to look into these projects:

If you just need a very specific component (just links or just templates) then DBPedia might be a better fit: DBPedia

Why Spark?

Apache Spark was chosen for a number of reasons. The primary reason is that it offers excellent support for working with numerous data formats. Reading the compressed BZ2 Wikipedia dump is just a few lines of code. The secondary reason is taking advantage of Spark's data frame and machine learning libraries. The data frame / data set library makes analyzing and joining data simple. The machine learning library has a wealth of useful tools. Lastly Spark makes the dump's size (15GB compressed ~80GB uncompressed) manageable.

Project Goals

Grant researchers easy access to the data stored in the Wikipedia data dump.
- This project aims at abstracting away the complexity of data preparation so the researchers can focus on analysis.
- A heavy focus is put on taking advantage of Spark's dataframe/dataset API.
Focus on content and semantics over formatting and syntax.
- This project assumes that the user is less concerned about formatting (bold, italics, element positioning) and more about what the content.
- We also assume that this data will be used in bulk and that minor parsing errors are not a huge concern.
Take advantage of Spark's built in data import and export functionality.
- No need to reinvent the wheel.