Home - nielsenbe/Spark-Wiki-Parser GitHub Wiki
What does this app do?
This application uses Apache Spark to parse a Wikipedia dump, clean it, and convert it a flattened series of files.
How is this software different from other parsers?
It is true that there are many Wikipedia parsers out there. This parser offers the following:
- Native Spark integration
- Does significant cleaning and formatting of the parser output
- Flattens the output into more manageable formats
Does this application make sense for your needs?
This application is designed to be used with Apache Spark and a cluster. This application will also work in local mode, but you'll probably want to use a partial dump file for testing. You can parse a full file using a single node, it'll just take a while. (8+ hours) If you have no intention to use a cluster or don't need the data cleaning this app provides then you may want to look into these projects:
If you just need a very specific component (just links or just templates) then DBPedia might be a better fit: DBPedia
Why Spark?
Apache Spark was chosen for a number of reasons. The primary reason is that it offers excellent support for working with numerous data formats. Reading the compressed BZ2 Wikipedia dump is just a few lines of code. The secondary reason is taking advantage of Spark's data frame and machine learning libraries. The data frame / data set library makes analyzing and joining data simple. The machine learning library has a wealth of useful tools. Lastly Spark makes the dump's size (15GB compressed ~80GB uncompressed) manageable.
Project Goals
- Grant researchers easy access to the data stored in the Wikipedia data dump.
- This project aims at abstracting away the complexity of data preparation so the researchers can focus on analysis.
- A heavy focus is put on taking advantage of Spark's dataframe/dataset API.
- Focus on content and semantics over formatting and syntax.
- This project assumes that the user is less concerned about formatting (bold, italics, element positioning) and more about what the content.
- We also assume that this data will be used in bulk and that minor parsing errors are not a huge concern.
- Take advantage of Spark's built in data import and export functionality.
- No need to reinvent the wheel.