Parser Wikipedia Run it - nielsenbe/Spark-Wiki-Parser GitHub Wiki

Get the dump file.

This application reads data from the MediaWiki xml dump files. These are BZ2 compressed XML files. Wikipedia is the largest (~ 15 GB compressed). They also break the files into 50 smaller parts (useful for testing).
Main site:

Wikimedia Downloads

Mirrors

  • enwiki = English Wikipedia
  • enwiki-[date]-pages-articles.xml.bz2 = Full backup
  • enwiki-[date]-pages-articles1.xml-p[id].bz2 = Full backup divided into 50 pieces

Caution using this file format:

  • pages-meta-history = Contains all revisions for a page. This greatly increases the size (into terabytes) and for most tasks is not needed.

Do not use this file format:

  • pages-meta-current = Contains user pages and discussion pages. These are not supported by the parser.

The basic element of the dump files is the 'page'. All pages have roughly the same element layout. A page can be a redirect (no wikitext), a standard Wikipedia article, a template, or a variety of other constructs.

Cluster Recommendations

The parsing phase is much more compute and memory intensive then the raw file size (15 GB compressed) would indicate. Parser is compatible down to 2.0+.

With 20 cores and 100GB of RAM parsing will take 2-3 hours. Adding more cores will roughly decrease that time in a linear fashion.

Required Jars

Package version depends on your Spark cluster version

com.databricks:spark-xml_2.11:0.5.0
com.github.nielsenbe:spark-wiki-parser_2.11:1.0

Arguments

--dumpfile

Full path of dump file. Path format will vary depending on source system(HDFS, S3, DFS, etc).

Examples:

s3://[bucketname]/enwiki-20190101-pages-articles-multistream.xml.bz2

adl://[account].azuredatalakestore.net/wiki/enwiki-20190101-pages-articles-multistream.xml.bz2

hdfs://wiki/enwiki-20190101-pages-articles-multistream.xml.bz2

/dbfs/wiki/enwiki-20190101-pages-articles-multistream.xml.bz2

--destloc

The path to where the parser should store the files.

--destformat (optional)

File format for DB tables. Default is parquet.

Examples: parquet json orc csv

--lowmemorymode (optional)

Caches the intermediate dataset on disk instead of in memory. User will need to clean up the temp file when run is complete. Files are stored in [--destformat]/stg True/false

Spark submit

Spark Submit Documentation Example:

spark-submit \
[Cluster settings (--master local, --master yarn, etc]
--class "com.github.nielsenbe.sparkwikiparser.wikipedia.sparkdbbuild.DatabaseBuildMain" \
--packages com.databricks:spark-xml_2.11:0.5.0,com.github.nielsenbe:spark-wiki-parser_2.11:1.0 \
"" \
"--dumpfile" "[dump file location]" \
"--destloc" "[dest file location]"

We are running the job from the maven package instead of a supplied jar. This is a little non standard but works.

Launch in notebook

Load the Spark XML and Spark-Wiki-Parser Maven coordinates to the environment.

import com.github.nielsenbe.sparkwikiparser.wikipedia.sparkdbbuild._

// Give location of dump file and 
val dumpFile = "s3://[bucket name]/[dump file name]"
val destinationFolder = "s3://[bucket]/wkp/"
val destinationFormat = "parquet"
val lowMemoryMode = false
val args = Arguments(dumpFile, destinationFolder, destinationFormat, lowMemoryMode)

val wf = new DatabaseBuildFull()
wf.parseFileAndCreateDatabase(spark, args)

Launch in notebook. Retrieve intermediate Dataset.

import com.github.nielsenbe.sparkwikiparser.wikipedia.sparkdbbuild._
import com.github.nielsenbe.sparkwikiparser.wikipedia.WikipediaPage

val dumpFile = "s3://[bucket name]/[dump file name]"
val destinationFolder = "s3://[bucket]/wkp/"
val destinationFormat = "parquet"
val lowMemoryMode = false

val args = Arguments(dumpFile, destinationFolder, destinationFormat, lowMemoryMode)

val parser = new ParserFunctions()
val parsedItems = parser.getWikipediaAsDataSet(spark, args)

Helpful Links

Load external libraries for Apache Zeppelin Link Azure/Livy Specific

Load external libraries for Jupyter Link

Load external libraries for Databricks [Link]